# Fault Tolerance Testing

This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.

## Overview

Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes:

| Test Category | Location | Purpose |
|---------------|----------|---------|
| Cancellation | `cancellation/` | Request cancellation during in-flight operations |
| Migration | `migration/` | Request migration when workers fail |
| etcd HA | `etcd_ha/` | etcd failover and recovery |
| Hardware | `hardware/` | GPU and network fault injection |
| Deployment | `deploy/` | End-to-end deployment testing |

## Test Directory Structure

```
tests/fault_tolerance/
├── cancellation/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── migration/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── etcd_ha/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── hardware/
│   └── fault_injection_service/
│       ├── api_service/
│       └── agents/
├── deploy/
│   ├── test_deployment.py
│   ├── scenarios.py
│   ├── base_checker.py
│   └── ...
└── client.py
```

## Request Cancellation Tests

Test that in-flight requests can be properly canceled.

### Running Cancellation Tests

```bash
# Run all cancellation tests
pytest tests/fault_tolerance/cancellation/ -v

# Run for specific backend
pytest tests/fault_tolerance/cancellation/test_vllm.py -v
```

### Cancellation Test Utilities

The `cancellation/utils.py` module provides:

#### CancellableRequest

Thread-safe request cancellation via TCP socket manipulation:

```python
from tests.fault_tolerance.cancellation.utils import CancellableRequest

request = CancellableRequest()

# Send request in separate thread
thread = Thread(target=send_request, args=(request,))
thread.start()

# Cancel after some time
time.sleep(1)
request.cancel()  # Closes underlying socket
```

#### send_completion_request / send_chat_completion_request

Send cancellable completion requests:

```python
from tests.fault_tolerance.cancellation.utils import (
    send_completion_request,
    send_chat_completion_request
)

# Non-streaming
response = send_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    prompt="Hello, world!",
    max_tokens=100
)

# Streaming with cancellation
responses = send_chat_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
    cancellable_request=request
)
```

#### poll_for_pattern

Wait for specific patterns in logs:

```python
from tests.fault_tolerance.cancellation.utils import poll_for_pattern

# Wait for cancellation confirmation
found = poll_for_pattern(
    log_file="/var/log/dynamo/worker.log",
    pattern="Request cancelled",
    timeout=30,
    interval=0.5
)
```

## Migration Tests

Test that requests migrate to healthy workers when failures occur.

### Running Migration Tests

```bash
# Run all migration tests
pytest tests/fault_tolerance/migration/ -v

# Run for specific backend
pytest tests/fault_tolerance/migration/test_vllm.py -v
```

### Migration Test Utilities

The `migration/utils.py` module provides:

- Frontend wrapper with configurable request planes
- Long-running request spawning for migration scenarios
- Health check disabling for controlled testing

### Example Migration Test

```python
def test_migration_on_worker_failure():
    # Start deployment with 2 workers
    deployment = start_deployment(workers=2)

    # Send long-running request
    request_thread = spawn_long_request(max_tokens=1000)

    # Kill one worker mid-generation
    kill_worker(deployment.workers[0])

    # Verify request completes on remaining worker
    response = request_thread.join()
    assert response.status_code == 200
    assert len(response.tokens) > 0
```

## etcd HA Tests

Test system behavior during etcd failures and recovery.

### Running etcd HA Tests

```bash
pytest tests/fault_tolerance/etcd_ha/ -v
```

### Test Scenarios

- **Leader failover**: etcd leader node fails, cluster elects new leader
- **Network partition**: etcd node becomes unreachable
- **Recovery**: System recovers after etcd becomes available

## Hardware Fault Injection

The fault injection service enables testing under simulated hardware failures.

### Fault Injection Service

Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection:

```bash
# Start the fault injection service
cd tests/fault_tolerance/hardware/fault_injection_service
python -m api_service.main
```

### Supported Fault Types

#### GPU Faults

| Fault Type | Description |
|------------|-------------|
| `XID_ERROR` | Simulate GPU XID error (various codes) |
| `THROTTLE` | GPU thermal throttling |
| `MEMORY_PRESSURE` | GPU memory exhaustion |
| `OVERHEAT` | GPU overheating condition |
| `COMPUTE_OVERLOAD` | GPU compute saturation |

#### Network Faults

| Fault Type | Description |
|------------|-------------|
| `FRONTEND_WORKER` | Partition between frontend and workers |
| `WORKER_NATS` | Partition between workers and NATS |
| `WORKER_WORKER` | Partition between workers |
| `CUSTOM` | Custom network partition |

### Fault Injection API

#### Inject GPU Fault

```bash
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
  -H "Content-Type: application/json" \
  -d '{
    "target_pod": "vllm-worker-0",
    "fault_type": "XID_ERROR",
    "severity": "HIGH"
  }'
```

#### Inject Specific XID Error

```bash
# Inject XID 79 (GPU memory page fault)
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
  -H "Content-Type: application/json" \
  -d '{"target_pod": "vllm-worker-0"}'
```

Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120

#### Inject Network Partition

```bash
curl -X POST http://localhost:8080/api/v1/faults/network/inject \
  -H "Content-Type: application/json" \
  -d '{
    "partition_type": "FRONTEND_WORKER",
    "duration_seconds": 30
  }'
```

#### Recover from Fault

```bash
curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover
```

#### List Active Faults

```bash
curl http://localhost:8080/api/v1/faults
```

### GPU Fault Injector Agent

The GPU fault injector runs as a DaemonSet on worker nodes:

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-fault-injector
spec:
  selector:
    matchLabels:
      app: gpu-fault-injector
  template:
    spec:
      containers:
      - name: agent
        image: dynamo/gpu-fault-injector:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: dev
          mountPath: /dev
```

The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection.

## Deployment Testing Framework

The `deploy/` directory contains an end-to-end testing framework.

### Test Phases

Tests run through three phases:

| Phase | Description |
|-------|-------------|
| `STANDARD` | Baseline performance under normal conditions |
| `OVERFLOW` | System behavior during fault/overload |
| `RECOVERY` | System recovery after fault resolution |

### Scenario Configuration

Define test scenarios in `scenarios.py`:

```python
from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure

scenario = Scenario(
    name="worker_failure_migration",
    backend="vllm",
    load=Load(
        clients=10,
        requests_per_client=100,
        max_tokens=256
    ),
    failure=Failure(
        type="pod_kill",
        target="vllm-worker-0",
        trigger_after_requests=50
    )
)
```

### Running Deployment Tests

```bash
# Run all deployment tests
pytest tests/fault_tolerance/deploy/test_deployment.py -v

# Run specific scenario
pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v
```

### Validation Checkers

The framework includes pluggable validators:

```python
from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext

class MigrationChecker(BaseChecker):
    def check(self, context: ValidationContext) -> bool:
        # Verify migrations occurred
        migrations = context.metrics.get("migrations_total", 0)
        return migrations > 0
```

### Results Parsing

Parse test results for analysis:

```python
from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test

results = process_overflow_recovery_test(log_dir="/path/to/logs")
print(f"Success rate: {results['success_rate']}")
print(f"P99 latency: {results['p99_latency_ms']}ms")
```

## Client Utilities

The `client.py` module provides shared client functionality:

### Multi-Threaded Load Generation

```python
from tests.fault_tolerance.client import client

# Generate load with multiple clients
results = client(
    base_url="http://localhost:8000",
    num_clients=10,
    requests_per_client=100,
    model="Qwen/Qwen3-0.6B",
    max_tokens=256,
    log_dir="/tmp/test_logs"
)
```

### Request Options

| Parameter | Description |
|-----------|-------------|
| `base_url` | Frontend URL |
| `num_clients` | Number of concurrent clients |
| `requests_per_client` | Requests per client |
| `model` | Model name |
| `max_tokens` | Max tokens per request |
| `log_dir` | Directory for client logs |
| `endpoint` | `completions` or `chat/completions` |

## Running the Full Test Suite

### Prerequisites

1. Kubernetes cluster with GPU nodes
2. Dynamo deployment
3. etcd cluster (for HA tests)
4. Fault injection service (for hardware tests)

### Environment Setup

```bash
export KUBECONFIG=/path/to/kubeconfig
export DYNAMO_NAMESPACE=dynamo-test
export FRONTEND_URL=http://localhost:8000
```

### Run All Tests

```bash
# Install test dependencies
pip install pytest pytest-asyncio

# Run all fault tolerance tests
pytest tests/fault_tolerance/ -v --tb=short

# Run with specific markers
pytest tests/fault_tolerance/ -v -m "not slow"
```

### Test Markers

| Marker | Description |
|--------|-------------|
| `slow` | Long-running tests (> 5 minutes) |
| `gpu` | Requires GPU resources |
| `k8s` | Requires Kubernetes cluster |
| `etcd_ha` | Requires multi-node etcd |

## Best Practices

### 1. Isolate Test Environments

Run fault tolerance tests in dedicated namespaces:

```bash
kubectl create namespace dynamo-fault-test
```

### 2. Clean Up After Tests

Ensure fault injection is recovered:

```bash
# List and recover all active faults
curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
  xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover
```

### 3. Collect Logs

Preserve logs for debugging:

```bash
pytest tests/fault_tolerance/ -v \
  --log-dir=/tmp/fault_test_logs \
  --capture=no
```

### 4. Monitor During Tests

Watch system state during tests:

```bash
# Terminal 1: Watch pods
watch kubectl get pods -n dynamo-test

# Terminal 2: Watch metrics
watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'
```

## Related Documentation

- [Request Migration](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-migration) - Migration implementation details
- [Request Cancellation](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-cancellation) - Cancellation implementation
- [Health Checks](/dynamo/v-0-9-0/user-guides/observability-local/health-checks) - Health monitoring
- [Metrics](/dynamo/v-0-9-0/user-guides/observability-local/metrics) - Available metrics for monitoring