Testing | NVIDIA Dynamo Documentation

This document describes the test infrastructure for validating Dynamo’s fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.

Overview

Dynamo’s fault tolerance test suite is located in tests/fault_tolerance/ and includes:

Test Category	Location	Purpose
Cancellation	`cancellation/`	Request cancellation during in-flight operations
Migration	`migration/`	Request migration when workers fail
etcd HA	`etcd_ha/`	etcd failover and recovery
Hardware	`hardware/`	GPU and network fault injection
Deployment	`deploy/`	End-to-end deployment testing

Test Directory Structure

tests/fault_tolerance/
├── cancellation/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── migration/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── etcd_ha/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── hardware/
│   └── fault_injection_service/
│       ├── api_service/
│       └── agents/
├── deploy/
│   ├── test_deployment.py
│   ├── scenarios.py
│   ├── base_checker.py
│   └── ...
└── client.py

Request Cancellation Tests

Test that in-flight requests can be properly canceled.

Running Cancellation Tests

$ # Run all cancellation tests
$ pytest tests/fault_tolerance/cancellation/ -v
$ 
$ # Run for specific backend
$ pytest tests/fault_tolerance/cancellation/test_vllm.py -v

Cancellation Test Utilities

The cancellation/utils.py module provides:

CancellableRequest

Thread-safe request cancellation via TCP socket manipulation:

1 from tests.fault_tolerance.cancellation.utils import CancellableRequest
2 
3 request = CancellableRequest()
4 
5 # Send request in separate thread
6 thread = Thread(target=send_request, args=(request,))
7 thread.start()
8 
9 # Cancel after some time
10 time.sleep(1)
11 request.cancel()  # Closes underlying socket

send_completion_request / send_chat_completion_request

Send cancellable completion requests:

1 from tests.fault_tolerance.cancellation.utils import (
2     send_completion_request,
3     send_chat_completion_request
4 )
5 
6 # Non-streaming
7 response = send_completion_request(
8     base_url="http://localhost:8000",
9     model="Qwen/Qwen3-0.6B",
10     prompt="Hello, world!",
11     max_tokens=100
12 )
13 
14 # Streaming with cancellation
15 responses = send_chat_completion_request(
16     base_url="http://localhost:8000",
17     model="Qwen/Qwen3-0.6B",
18     messages=[{"role": "user", "content": "Hello!"}],
19     stream=True,
20     cancellable_request=request
21 )

poll_for_pattern

Wait for specific patterns in logs:

1 from tests.fault_tolerance.cancellation.utils import poll_for_pattern
2 
3 # Wait for cancellation confirmation
4 found = poll_for_pattern(
5     log_file="/var/log/dynamo/worker.log",
6     pattern="Request cancelled",
7     timeout=30,
8     interval=0.5
9 )

Migration Tests

Test that requests migrate to healthy workers when failures occur.

Running Migration Tests

$ # Run all migration tests
$ pytest tests/fault_tolerance/migration/ -v
$ 
$ # Run for specific backend
$ pytest tests/fault_tolerance/migration/test_vllm.py -v

Migration Test Utilities

The migration/utils.py module provides:

Frontend wrapper with configurable request planes
Long-running request spawning for migration scenarios
Health check disabling for controlled testing

Example Migration Test

1 def test_migration_on_worker_failure():
2     # Start deployment with 2 workers
3     deployment = start_deployment(workers=2)
4 
5     # Send long-running request
6     request_thread = spawn_long_request(max_tokens=1000)
7 
8     # Kill one worker mid-generation
9     kill_worker(deployment.workers[0])
10 
11     # Verify request completes on remaining worker
12     response = request_thread.join()
13     assert response.status_code == 200
14     assert len(response.tokens) > 0

etcd HA Tests

Test system behavior during etcd failures and recovery.

Running etcd HA Tests

$ pytest tests/fault_tolerance/etcd_ha/ -v

Test Scenarios

Leader failover: etcd leader node fails, cluster elects new leader
Network partition: etcd node becomes unreachable
Recovery: System recovers after etcd becomes available

Hardware Fault Injection

The fault injection service enables testing under simulated hardware failures.

Fault Injection Service

Located at tests/fault_tolerance/hardware/fault_injection_service/, this FastAPI service orchestrates fault injection:

$ # Start the fault injection service
$ cd tests/fault_tolerance/hardware/fault_injection_service
$ python -m api_service.main

Supported Fault Types

GPU Faults

Fault Type	Description
`XID_ERROR`	Simulate GPU XID error (various codes)
`THROTTLE`	GPU thermal throttling
`MEMORY_PRESSURE`	GPU memory exhaustion
`OVERHEAT`	GPU overheating condition
`COMPUTE_OVERLOAD`	GPU compute saturation

Network Faults

Fault Type	Description
`FRONTEND_WORKER`	Partition between frontend and workers
`WORKER_NATS`	Partition between workers and NATS
`WORKER_WORKER`	Partition between workers
`CUSTOM`	Custom network partition

Fault Injection API

Inject GPU Fault

$ curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
>   -H "Content-Type: application/json" \
>   -d '{
>     "target_pod": "vllm-worker-0",
>     "fault_type": "XID_ERROR",
>     "severity": "HIGH"
>   }'

Inject Specific XID Error

$ # Inject XID 79 (GPU memory page fault)
$ curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
>   -H "Content-Type: application/json" \
>   -d '{"target_pod": "vllm-worker-0"}'

Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120

Inject Network Partition

$ curl -X POST http://localhost:8080/api/v1/faults/network/inject \
>   -H "Content-Type: application/json" \
>   -d '{
>     "partition_type": "FRONTEND_WORKER",
>     "duration_seconds": 30
>   }'

Recover from Fault

$ curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover

List Active Faults

$ curl http://localhost:8080/api/v1/faults

GPU Fault Injector Agent

The GPU fault injector runs as a DaemonSet on worker nodes:

1 apiVersion: apps/v1
2 kind: DaemonSet
3 metadata:
4   name: gpu-fault-injector
5 spec:
6   selector:
7     matchLabels:
8       app: gpu-fault-injector
9   template:
10     spec:
11       containers:
12       - name: agent
13         image: dynamo/gpu-fault-injector:latest
14         securityContext:
15           privileged: true
16         volumeMounts:
17         - name: dev
18           mountPath: /dev

The agent injects fake XID messages via /dev/kmsg to trigger NVSentinel detection.

Deployment Testing Framework

The deploy/ directory contains an end-to-end testing framework.

Test Phases

Tests run through three phases:

Phase	Description
`STANDARD`	Baseline performance under normal conditions
`OVERFLOW`	System behavior during fault/overload
`RECOVERY`	System recovery after fault resolution

Scenario Configuration

Define test scenarios in scenarios.py:

1 from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure
2 
3 scenario = Scenario(
4     name="worker_failure_migration",
5     backend="vllm",
6     load=Load(
7         clients=10,
8         requests_per_client=100,
9         max_tokens=256
10     ),
11     failure=Failure(
12         type="pod_kill",
13         target="vllm-worker-0",
14         trigger_after_requests=50
15     )
16 )

Running Deployment Tests

$ # Run all deployment tests
$ pytest tests/fault_tolerance/deploy/test_deployment.py -v
$ 
$ # Run specific scenario
$ pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v

Validation Checkers

The framework includes pluggable validators:

1 from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext
2 
3 class MigrationChecker(BaseChecker):
4     def check(self, context: ValidationContext) -> bool:
5         # Verify migrations occurred
6         migrations = context.metrics.get("migrations_total", 0)
7         return migrations > 0

Results Parsing

Parse test results for analysis:

1 from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test
2 
3 results = process_overflow_recovery_test(log_dir="/path/to/logs")
4 print(f"Success rate: {results['success_rate']}")
5 print(f"P99 latency: {results['p99_latency_ms']}ms")

Client Utilities

The client.py module provides shared client functionality:

Multi-Threaded Load Generation

1 from tests.fault_tolerance.client import client
2 
3 # Generate load with multiple clients
4 results = client(
5     base_url="http://localhost:8000",
6     num_clients=10,
7     requests_per_client=100,
8     model="Qwen/Qwen3-0.6B",
9     max_tokens=256,
10     log_dir="/tmp/test_logs"
11 )

Request Options

Parameter	Description
`base_url`	Frontend URL
`num_clients`	Number of concurrent clients
`requests_per_client`	Requests per client
`model`	Model name
`max_tokens`	Max tokens per request
`log_dir`	Directory for client logs
`endpoint`	`completions` or `chat/completions`

Running the Full Test Suite

Prerequisites

Kubernetes cluster with GPU nodes
Dynamo deployment
etcd cluster (for HA tests)
Fault injection service (for hardware tests)

Environment Setup

$ export KUBECONFIG=/path/to/kubeconfig
$ export DYNAMO_NAMESPACE=dynamo-test
$ export FRONTEND_URL=http://localhost:8000

Run All Tests

$ # Install test dependencies
$ pip install pytest pytest-asyncio
$ 
$ # Run all fault tolerance tests
$ pytest tests/fault_tolerance/ -v --tb=short
$ 
$ # Run with specific markers
$ pytest tests/fault_tolerance/ -v -m "not slow"

Test Markers

Marker	Description
`slow`	Long-running tests (> 5 minutes)
`gpu`	Requires GPU resources
`k8s`	Requires Kubernetes cluster
`etcd_ha`	Requires multi-node etcd

Best Practices

1. Isolate Test Environments

Run fault tolerance tests in dedicated namespaces:

$ kubectl create namespace dynamo-fault-test

2. Clean Up After Tests

Ensure fault injection is recovered:

$ # List and recover all active faults
$ curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
>   xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover

3. Collect Logs

Preserve logs for debugging:

$ pytest tests/fault_tolerance/ -v \
>   --log-dir=/tmp/fault_test_logs \
>   --capture=no

4. Monitor During Tests

Watch system state during tests:

$ # Terminal 1: Watch pods
$ watch kubectl get pods -n dynamo-test
$ 
$ # Terminal 2: Watch metrics
$ watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'

Request Migration - Migration implementation details
Request Cancellation - Cancellation implementation
Health Checks - Health monitoring
Metrics - Available metrics for monitoring