Fault Tolerance Testing

View as Markdown

This document describes the test infrastructure for validating Dynamo’s fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.

Overview

Dynamo’s fault tolerance test suite is located in tests/fault_tolerance/ and includes:

Test CategoryLocationPurpose
Cancellationcancellation/Request cancellation during in-flight operations
Migrationmigration/Request migration when workers fail
etcd HAetcd_ha/etcd failover and recovery
Hardwarehardware/GPU and network fault injection
Deploymentdeploy/End-to-end deployment testing

Test Directory Structure

tests/fault_tolerance/
├── cancellation/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── migration/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── etcd_ha/
│ ├── test_vllm.py
│ ├── test_trtllm.py
│ ├── test_sglang.py
│ └── utils.py
├── hardware/
│ └── fault_injection_service/
│ ├── api_service/
│ └── agents/
├── deploy/
│ ├── test_deployment.py
│ ├── scenarios.py
│ ├── base_checker.py
│ └── ...
└── client.py

Request Cancellation Tests

Test that in-flight requests can be properly canceled.

Running Cancellation Tests

$# Run all cancellation tests
$pytest tests/fault_tolerance/cancellation/ -v
$
$# Run for specific backend
$pytest tests/fault_tolerance/cancellation/test_vllm.py -v

Cancellation Test Utilities

The cancellation/utils.py module provides:

CancellableRequest

Thread-safe request cancellation via TCP socket manipulation:

1from tests.fault_tolerance.cancellation.utils import CancellableRequest
2
3request = CancellableRequest()
4
5# Send request in separate thread
6thread = Thread(target=send_request, args=(request,))
7thread.start()
8
9# Cancel after some time
10time.sleep(1)
11request.cancel() # Closes underlying socket

send_completion_request / send_chat_completion_request

Send cancellable completion requests:

1from tests.fault_tolerance.cancellation.utils import (
2 send_completion_request,
3 send_chat_completion_request
4)
5
6# Non-streaming
7response = send_completion_request(
8 base_url="http://localhost:8000",
9 model="Qwen/Qwen3-0.6B",
10 prompt="Hello, world!",
11 max_tokens=100
12)
13
14# Streaming with cancellation
15responses = send_chat_completion_request(
16 base_url="http://localhost:8000",
17 model="Qwen/Qwen3-0.6B",
18 messages=[{"role": "user", "content": "Hello!"}],
19 stream=True,
20 cancellable_request=request
21)

poll_for_pattern

Wait for specific patterns in logs:

1from tests.fault_tolerance.cancellation.utils import poll_for_pattern
2
3# Wait for cancellation confirmation
4found = poll_for_pattern(
5 log_file="/var/log/dynamo/worker.log",
6 pattern="Request cancelled",
7 timeout=30,
8 interval=0.5
9)

Migration Tests

Test that requests migrate to healthy workers when failures occur.

Running Migration Tests

$# Run all migration tests
$pytest tests/fault_tolerance/migration/ -v
$
$# Run for specific backend
$pytest tests/fault_tolerance/migration/test_vllm.py -v

Migration Test Utilities

The migration/utils.py module provides:

  • Frontend wrapper with configurable request planes
  • Long-running request spawning for migration scenarios
  • Health check disabling for controlled testing

Example Migration Test

1def test_migration_on_worker_failure():
2 # Start deployment with 2 workers
3 deployment = start_deployment(workers=2)
4
5 # Send long-running request
6 request_thread = spawn_long_request(max_tokens=1000)
7
8 # Kill one worker mid-generation
9 kill_worker(deployment.workers[0])
10
11 # Verify request completes on remaining worker
12 response = request_thread.join()
13 assert response.status_code == 200
14 assert len(response.tokens) > 0

etcd HA Tests

Test system behavior during etcd failures and recovery.

Running etcd HA Tests

$pytest tests/fault_tolerance/etcd_ha/ -v

Test Scenarios

  • Leader failover: etcd leader node fails, cluster elects new leader
  • Network partition: etcd node becomes unreachable
  • Recovery: System recovers after etcd becomes available

Hardware Fault Injection

The fault injection service enables testing under simulated hardware failures.

Fault Injection Service

Located at tests/fault_tolerance/hardware/fault_injection_service/, this FastAPI service orchestrates fault injection:

$# Start the fault injection service
$cd tests/fault_tolerance/hardware/fault_injection_service
$python -m api_service.main

Supported Fault Types

GPU Faults

Fault TypeDescription
XID_ERRORSimulate GPU XID error (various codes)
THROTTLEGPU thermal throttling
MEMORY_PRESSUREGPU memory exhaustion
OVERHEATGPU overheating condition
COMPUTE_OVERLOADGPU compute saturation

Network Faults

Fault TypeDescription
FRONTEND_WORKERPartition between frontend and workers
WORKER_NATSPartition between workers and NATS
WORKER_WORKERPartition between workers
CUSTOMCustom network partition

Fault Injection API

Inject GPU Fault

$curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
> -H "Content-Type: application/json" \
> -d '{
> "target_pod": "vllm-worker-0",
> "fault_type": "XID_ERROR",
> "severity": "HIGH"
> }'

Inject Specific XID Error

$# Inject XID 79 (GPU memory page fault)
$curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
> -H "Content-Type: application/json" \
> -d '{"target_pod": "vllm-worker-0"}'

Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120

Inject Network Partition

$curl -X POST http://localhost:8080/api/v1/faults/network/inject \
> -H "Content-Type: application/json" \
> -d '{
> "partition_type": "FRONTEND_WORKER",
> "duration_seconds": 30
> }'

Recover from Fault

$curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover

List Active Faults

$curl http://localhost:8080/api/v1/faults

GPU Fault Injector Agent

The GPU fault injector runs as a DaemonSet on worker nodes:

1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4 name: gpu-fault-injector
5spec:
6 selector:
7 matchLabels:
8 app: gpu-fault-injector
9 template:
10 spec:
11 containers:
12 - name: agent
13 image: dynamo/gpu-fault-injector:latest
14 securityContext:
15 privileged: true
16 volumeMounts:
17 - name: dev
18 mountPath: /dev

The agent injects fake XID messages via /dev/kmsg to trigger NVSentinel detection.

Deployment Testing Framework

The deploy/ directory contains an end-to-end testing framework.

Test Phases

Tests run through three phases:

PhaseDescription
STANDARDBaseline performance under normal conditions
OVERFLOWSystem behavior during fault/overload
RECOVERYSystem recovery after fault resolution

Scenario Configuration

Define test scenarios in scenarios.py:

1from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure
2
3scenario = Scenario(
4 name="worker_failure_migration",
5 backend="vllm",
6 load=Load(
7 clients=10,
8 requests_per_client=100,
9 max_tokens=256
10 ),
11 failure=Failure(
12 type="pod_kill",
13 target="vllm-worker-0",
14 trigger_after_requests=50
15 )
16)

Running Deployment Tests

$# Run all deployment tests
$pytest tests/fault_tolerance/deploy/test_deployment.py -v
$
$# Run specific scenario
$pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v

Validation Checkers

The framework includes pluggable validators:

1from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext
2
3class MigrationChecker(BaseChecker):
4 def check(self, context: ValidationContext) -> bool:
5 # Verify migrations occurred
6 migrations = context.metrics.get("migrations_total", 0)
7 return migrations > 0

Results Parsing

Parse test results for analysis:

1from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test
2
3results = process_overflow_recovery_test(log_dir="/path/to/logs")
4print(f"Success rate: {results['success_rate']}")
5print(f"P99 latency: {results['p99_latency_ms']}ms")

Client Utilities

The client.py module provides shared client functionality:

Multi-Threaded Load Generation

1from tests.fault_tolerance.client import client
2
3# Generate load with multiple clients
4results = client(
5 base_url="http://localhost:8000",
6 num_clients=10,
7 requests_per_client=100,
8 model="Qwen/Qwen3-0.6B",
9 max_tokens=256,
10 log_dir="/tmp/test_logs"
11)

Request Options

ParameterDescription
base_urlFrontend URL
num_clientsNumber of concurrent clients
requests_per_clientRequests per client
modelModel name
max_tokensMax tokens per request
log_dirDirectory for client logs
endpointcompletions or chat/completions

Running the Full Test Suite

Prerequisites

  1. Kubernetes cluster with GPU nodes
  2. Dynamo deployment
  3. etcd cluster (for HA tests)
  4. Fault injection service (for hardware tests)

Environment Setup

$export KUBECONFIG=/path/to/kubeconfig
$export DYNAMO_NAMESPACE=dynamo-test
$export FRONTEND_URL=http://localhost:8000

Run All Tests

$# Install test dependencies
$pip install pytest pytest-asyncio
$
$# Run all fault tolerance tests
$pytest tests/fault_tolerance/ -v --tb=short
$
$# Run with specific markers
$pytest tests/fault_tolerance/ -v -m "not slow"

Test Markers

MarkerDescription
slowLong-running tests (> 5 minutes)
gpuRequires GPU resources
k8sRequires Kubernetes cluster
etcd_haRequires multi-node etcd

Best Practices

1. Isolate Test Environments

Run fault tolerance tests in dedicated namespaces:

$kubectl create namespace dynamo-fault-test

2. Clean Up After Tests

Ensure fault injection is recovered:

$# List and recover all active faults
$curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
> xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover

3. Collect Logs

Preserve logs for debugging:

$pytest tests/fault_tolerance/ -v \
> --log-dir=/tmp/fault_test_logs \
> --capture=no

4. Monitor During Tests

Watch system state during tests:

$# Terminal 1: Watch pods
$watch kubectl get pods -n dynamo-test
$
$# Terminal 2: Watch metrics
$watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'