---
title: Dynamo KV Smart Router A/B Benchmarking Guide
---
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
## Overview
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
1. Deploy two identical Dynamo configurations:
a. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITHOUT** KV Smart Router enabled
b. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITH** KV Smart Router enabled
2. Run controlled benchmarks using AIPerf
3. Compare performance metrics to evaluate KV router effectiveness
Kubernetes cluster with GPUs, kubectl, helm
---
## Prerequisites
### Required Tools
- `kubectl` (configured with cluster access)
- `helm` (v3+)
- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
- GPU nodes (H100, H200, or similar)
- Sufficient GPU capacity (16+ GPUs recommended for this example)
- Dynamo platform installed globally OR ability to install per-namespace
### Knowledge Requirements
- Basic Kubernetes concepts (namespaces, pods, services)
- Familiarity with LLM inference concepts
- Command-line proficiency
---
## Architecture
This guide sets up two parallel deployments, as well as a benchmarking pod that can test each deployment:
```text
┌─────────────────────────────────────┐
│ Deployment A: Router OFF │
│ Namespace: router-off-test │
│ ├─ Frontend (Standard Routing) │
│ └─ 8x Decode Workers (1 GPU each) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Deployment B: Router ON │
│ Namespace: router-on-test │
│ ├─ Frontend (KV Smart Router) │
│ └─ 8x Decode Workers (1 GPU each) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Benchmark Pod │
│ Namespace: benchmark │
│ └─ AIPerf + Dataset │
└─────────────────────────────────────┘
```
**Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing.
---
## Phase 1: Namespace and Infrastructure Setup
### Step 1.1: Create Namespaces
```bash
# Create namespaces for both deployments
kubectl create namespace router-off-test
kubectl create namespace router-on-test
kubectl create namespace benchmark
```
### Step 1.2: Create HuggingFace Token Secret (optional)
If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:
```bash
# Router-OFF namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
-n router-off-test
# Router-ON namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
-n router-on-test
```
### Step 1.3: Install Dynamo Platform (Per-Namespace)
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/v0.8.0/docs/kubernetes/installation-guide.md) to install the platform in both namespaces:
- `router-off-test`
- `router-on-test`
**Key Configuration Notes:**
- If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation
- Adjust version tags to match your cluster's available Dynamo versions
- If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation
### Step 1.4: Verify Infrastructure
Wait for operators and infrastructure to be ready:
```bash
# Check router-off-test
kubectl get pods -n router-off-test
# Check router-on-test
kubectl get pods -n router-on-test
```
You should see:
- `dynamo-platform-dynamo-operator-controller-manager` (2/2 Running)
- `dynamo-platform-etcd-0` (1/1 Running)
- `dynamo-platform-nats-0` (2/2 Running)
---
## Phase 2: Deploy Model Serving
### Step 2.1: Create Deployment YAMLs
Create `router-off-deployment.yaml`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-no-router
spec:
services:
Frontend:
dynamoNamespace: vllm-agg-no-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
VllmDecodeWorker:
envFromSecret: hf-token-secret
dynamoNamespace: vllm-agg-no-router
componentType: worker
replicas: 8
resources:
limits:
gpu: "1"
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- gpu-h200-sxm # Adjust to your GPU node type
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-32B --quantization fp8
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 60 # 32 minutes total (120s + 60*30s)
livenessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
readinessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
```
Create `router-on-deployment.yaml`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-router
spec:
services:
Frontend:
dynamoNamespace: vllm-agg-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
envs:
- name: DYN_ROUTER_MODE
value: kv # KEY DIFFERENCE: Enable KV Smart Router
VllmDecodeWorker:
envFromSecret: hf-token-secret
dynamoNamespace: vllm-agg-router
componentType: worker
replicas: 8
resources:
limits:
gpu: "1"
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- gpu-h200-sxm # Adjust to your GPU node type
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-32B --quantization fp8
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 60 # 32 minutes total (120s + 60*30s)
livenessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
readinessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
```
### Step 2.2: Deploy Both Configurations
```bash
# Deploy router-OFF
kubectl apply -f router-off-deployment.yaml -n router-off-test
# Deploy router-ON
kubectl apply -f router-on-deployment.yaml -n router-on-test
```
**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.
First, create the PVC separately:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
storageClassName: "your-shared-storage-class" # e.g., nfs, efs, nebius-shared-fs
resources:
requests:
storage: 100Gi
```
Then reference it in your DynamoGraphDeployment:
```yaml
spec:
pvcs:
- create: false
name: model-cache
size: "0"
services:
VllmDecodeWorker:
volumeMounts:
- mountPoint: /root/.cache/huggingface
name: model-cache
useAsCompilationCache: false
```
With this configuration, only the first worker downloads the model; others use the cached version, reducing startup time from 20+ minutes to ~2 minutes per pod.
### Step 2.3: Monitor Deployment Progress
```bash
# Watch router-OFF pods
kubectl get pods -n router-off-test -w
# Watch router-ON pods
kubectl get pods -n router-on-test -w
```
Wait for all pods to reach `Running` status and pass readiness probes.
**Expected Timeline:**
- **With shared PVC** (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
- **Without shared PVC**: 20-30 minutes per worker (workers download independently)
- For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)
The startup probe allows 32 minutes per pod (failureThreshold: 60), which accommodates model download and initialization.
### Step 2.4: Verify All Workers Are Healthy
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health in both deployments. Unequal worker counts will invalidate your comparison results.
```bash
# Quick health check - both should show "8/8"
echo "Router OFF: $(kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
echo "Router ON: $(kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
# Detailed view
kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker
kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker
```
**Both must show 8/8 workers in Ready state (1/1 Running).** If workers are not ready:
- Check logs: `kubectl logs -n `
- Common issues: model download in progress, startup probe timeout, insufficient GPU resources
**Do not proceed with benchmarks until all 16 workers (8 per deployment) are healthy.**
---
## Phase 3: Prepare Benchmark Dataset
### Understanding the Mooncake Trace Dataset
For this A/B comparison, we use the **Mooncake Trace Dataset**, published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake). This is a privacy-preserving dataset of real-world LLM inference traffic from production arxiv workloads.
**What's in the dataset?** Each trace entry contains:
- **Timestamp:** When the request arrived (for realistic request timing)
- **Input/output lengths:** Number of tokens in prompts and responses
- **Block hash IDs:** Cryptographic hashes representing KV cache blocks (explained below)
**Sample trace entry:**
```json
{
"timestamp": 27482,
"input_length": 6955,
"output_length": 52,
"hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
```
### Why Mooncake Traces Matter for KV Cache Benchmarking
**The Challenge:** Traditional LLM benchmarks use synthetic or random data, which are often insufficient to capture real-world optimizations like KV Smart Router. To properly evaluate this feature, we need realistic traffic patterns with **prefix repetition** - but this creates a privacy problem: how do we measure realistic KV cache hit patterns without exposing actual user conversations?
**Mooncake's Solution: Privacy-Preserving Block Hashes**
Instead of storing actual prompt text, the Mooncake dataset uses cryptographic hashes to represent KV cache blocks. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks. This preserves the **pattern of prefix reuse** while completely protecting user privacy.
### How it works - Multi-turn conversation example
```text
Turn 1 (initial request - long document analysis):
Input: ~8,000 tokens (e.g., research paper + question)
Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61]
└─ 16 blocks × 512 tokens/block = ~8,192 tokens
Turn 2 (follow-up question on same document):
Input: Same document + new question (~8,500 tokens)
Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62]
└──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘
✅ Cache hit: First 8,192 tokens don't need recomputation!
Turn 3 (another follow-up):
Input: Same document + different question (~9,000 tokens)
Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63]
└──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘
```
When requests share the same hash IDs (e.g., blocks 46-61), it means they share those 512-token blocks - indicating **significant prefix overlap** (in this case, 8,192 tokens). The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits and avoiding redundant computation for those shared prefix tokens.
**Key Dataset Properties:**
- ✅ **Realistic timing:** Request arrival patterns from production workloads
- ✅ **Real prefix patterns:** Up to 50% cache hit ratio ([Mooncake technical report](https://github.com/kvcache-ai/Mooncake))
- ✅ **Privacy-preserving:** No actual text - only hash-based cache block identifiers
- ✅ **Reproducible:** Public dataset enables fair comparisons across different systems
**Why this matters:** With random synthetic data, the KV Smart Router would show no benefit because there's no prefix reuse to exploit. Mooncake traces provide realistic workload patterns that demonstrate the router's real-world performance gains while respecting user privacy.
---
### Download and Prepare the Dataset
```bash
# Download the Mooncake arxiv trace dataset
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl -o mooncake_trace.jsonl
# Trim to 1000 requests for faster benchmarking
head -n 1000 mooncake_trace.jsonl > mooncake_trace_small.jsonl
# Speed up timestamps 4x (reduces benchmark time from ~12 min to ~3 min)
python3 - <<'PY'
import json
with open("mooncake_trace_small.jsonl") as src, open("mooncake_trace_4x.jsonl", "w") as dst:
for line in src:
rec = json.loads(line)
rec["timestamp"] = int(rec["timestamp"] / 4)
dst.write(json.dumps(rec) + "\n")
PY
echo "Dataset ready: mooncake_trace_4x.jsonl (1000 requests, 4x speed)"
```
---
## Phase 4: Set Up Benchmark Environment
### Step 4.1: Deploy Benchmark Pod
Create `benchmark-job.yaml`:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: benchmark
spec:
backoffLimit: 1
template:
spec:
restartPolicy: Never
containers:
- name: benchmark
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
command: ["/bin/sh", "-c", "sleep infinity"]
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 0
```
Deploy:
```bash
kubectl apply -f benchmark-job.yaml
```
Wait for pod to be ready:
```bash
kubectl get pods -n benchmark
```
### Step 4.2: Copy Dataset to Benchmark Pod
```bash
POD_NAME=$(kubectl get pods -n benchmark -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}')
kubectl -n benchmark cp mooncake_trace_4x.jsonl ${POD_NAME}:/tmp/mooncake_trace_4x.jsonl
```
### Step 4.3: Install AIPerf
```bash
kubectl -n benchmark exec ${POD_NAME} -- bash -lc '. /opt/dynamo/venv/bin/activate && pip install -q aiperf'
```
---
## Phase 5: Run Benchmarks
### Step 5.1: Benchmark Router-OFF (Baseline)
```bash
kubectl -n benchmark exec ${POD_NAME} -- bash -lc '
. /opt/dynamo/venv/bin/activate
aiperf profile \
--model "Qwen/Qwen3-32B" \
--url "http://vllm-agg-no-router-frontend.router-off-test.svc.cluster.local:8000" \
--endpoint-type chat \
--input-file /tmp/mooncake_trace_4x.jsonl \
--custom-dataset-type mooncake_trace \
--tokenizer "Qwen/Qwen3-32B" \
--streaming \
--request-count 1000 \
--fixed-schedule \
--output-artifact-dir /tmp/router_off_results
'
```
This will take 3-5 minutes. The terminal output includes a summary table.
### Step 5.2: Benchmark Router-ON (KV Smart Router)
```bash
kubectl -n benchmark exec ${POD_NAME} -- bash -lc '
. /opt/dynamo/venv/bin/activate
aiperf profile \
--model "Qwen/Qwen3-32B" \
--url "http://vllm-agg-router-frontend.router-on-test.svc.cluster.local:8000" \
--endpoint-type chat \
--input-file /tmp/mooncake_trace_4x.jsonl \
--custom-dataset-type mooncake_trace \
--tokenizer "Qwen/Qwen3-32B" \
--streaming \
--request-count 1000 \
--fixed-schedule \
--output-artifact-dir /tmp/router_on_results
'
```
### Step 5.3: Collect Results
```bash
# Copy results to local machine
kubectl -n benchmark cp ${POD_NAME}:/tmp/router_off_results/profile_export_aiperf.csv ./router_off_results.csv
kubectl -n benchmark cp ${POD_NAME}:/tmp/router_on_results/profile_export_aiperf.csv ./router_on_results.csv
```
---
## Phase 6: Analyze Results
### Key Metrics to Compare
| Metric | Description | What to Look For |
|--------|-------------|------------------|
| **Time to First Token (TTFT)** | Latency until first token arrives | Lower is better; KV router may reduce with prefix reuse |
| **Inter Token Latency (ITL)** | Average time between tokens | Lower is better; indicates generation speed |
| **Request Latency** | Total end-to-end latency | Lower is better; overall user experience |
| **Output Token Throughput** | Tokens generated per second (system-wide) | Higher is better; system efficiency |
| **Request Throughput** | Requests completed per second | Higher is better; capacity |
### Interpreting Results
**Your Results May Vary**: The improvement from KV Smart Router depends heavily on your workload characteristics:
**Factors that increase KV router benefit:**
- **High prefix overlap** (shared system prompts, templates, document contexts)
- **Long prompts** (>2000 tokens) where caching saves significant compute
- **Multi-turn conversations** with context carryover
- **Batch workloads** with similar queries
**Factors that reduce KV router benefit:**
- **Unique prompts** with no prefix reuse
- **Short prompts** (\<1000 tokens) where routing overhead exceeds benefit
- **Evenly distributed load** where round-robin is already optimal
- **Low request rate** where cache eviction negates benefits
**Expected Performance:**
- **High prefix overlap workloads**: 20-50% TTFT improvement
- **Moderate prefix overlap**: 10-20% improvement
- **Low prefix overlap**: \<5% improvement (may not be worth enabling)
**KV Smart Router is beneficial when:**
- TTFT improvements > 20%
- No significant degradation in other metrics
- Workload demonstrates measurable prefix reuse patterns
**Standard routing is better when:**
- KV router shows \<10% improvement
- Increased latency variance is observed
- Load distribution across workers is more important than cache affinity
### Example Comparison
From the terminal output, compare the summary tables:
```
Router-OFF (Baseline):
TTFT avg: 12,764 ms p99: 45,898 ms
Request Latency avg: 32,978 ms
Output Token Throughput: 1,614 tokens/sec
Request Throughput: 8.61 req/sec
Router-ON (KV Router):
TTFT avg: 8,012 ms p99: 28,644 ms (37% faster ✅)
Request Latency avg: 28,972 ms (12% faster ✅)
Output Token Throughput: 1,746 tokens/sec (8% higher ✅)
Request Throughput: 9.33 req/sec (8% higher ✅)
```
In this example with all 8 workers healthy, the **KV router significantly outperformed** the baseline:
- **37% faster TTFT** - Users see first token much sooner
- **8% higher throughput** - System processes more requests per second
- **12% lower latency** - Faster end-to-end completion
The Mooncake arxiv dataset has sufficient prefix overlap (long input sequences with similar patterns) to benefit from KV cache-aware routing. Workloads with explicit shared prefixes (system prompts, templates) may see even greater improvements.
---
## Phase 7: Cleanup
```bash
# Delete deployments
kubectl delete dynamographdeployment vllm-agg-no-router -n router-off-test
kubectl delete dynamographdeployment vllm-agg-router -n router-on-test
# Delete namespaces (removes all resources)
kubectl delete namespace router-off-test
kubectl delete namespace router-on-test
kubectl delete namespace benchmark
```
---
## Troubleshooting
### Issue: Pods Stuck in Pending
Insufficient GPU resources
**Solution:**
```bash
# Check GPU availability
kubectl describe nodes | grep -A 10 "Allocated resources"
# Reduce worker replicas if needed
kubectl edit dynamographdeployment -n
```
### Issue: ImagePullBackOff Errors
Version mismatch or missing credentials
**Solution:**
```bash
# Check available versions
kubectl get pods -n dynamo-system -o yaml | grep image:
# Update deployment YAML to match cluster version
```
### Issue: Operator Not Processing Deployment
Namespace restrictions
**Solution:**
- Ensure Dynamo platform is Helm-installed in the namespace
- Verify operator has `--restrictedNamespace=` argument
- Check operator logs: `kubectl logs -n deployment/dynamo-platform-dynamo-operator-controller-manager`
### Issue: Workers Not Becoming Ready
Model download failures or probe configuration
**Solution:**
```bash
# Check worker logs
kubectl logs -n
# Common issues:
# - Invalid HuggingFace token
# - Network connectivity
# - Insufficient disk space for model
```
### Issue: Workers Restarting in CrashLoopBackOff
Startup probe timeout - workers killed before finishing initialization
**Symptoms:**
- Pods show "Container main failed startup probe, will be restarted"
- Logs show model still downloading or loading when pod is killed
- Large models (>30GB) take longer than default 22-minute timeout
**Solution:**
Increase the startup probe `failureThreshold`:
```bash
# Patch the deployment to allow 32 minutes instead of 22
kubectl patch dynamographdeployment -n --type='json' \
-p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 60}]'
```
Or update your YAML before deploying:
```yaml
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 60 # 32 minutes total (120s + 60*30s)
```
**Model Loading Times (approximate):**
- Qwen3-32B: ~20-25 minutes (first download)
- Llama-70B: ~25-30 minutes (first download)
- With cached model on node: ~2-5 minutes
### Issue: Unequal Worker Health
Resource constraints, image pull issues, or configuration errors
**Solution:**
```bash
# Check all worker status
kubectl get pods -n -l nvidia.com/dynamo-component-type=worker
# Describe problematic pods
kubectl describe pod -n
# Fix issues before benchmarking or results will be skewed
```
---
## Advanced Configuration
### Testing Different Models
Replace `Qwen/Qwen3-32B` with your model in:
- Deployment YAML `args` section
- AIPerf `--model` and `--tokenizer` parameters
### Adjusting Worker Count
Change `replicas: 8` in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.
### Using Custom Datasets
Replace mooncake dataset with your own JSONL file:
- Format: One request per line with `timestamp` field
- AIPerf supports various formats via `--custom-dataset-type`
### Disaggregated Prefill/Decode
For advanced testing, add separate prefill workers:
```yaml
VllmPrefillWorker:
componentType: worker
replicas: 2
# ... configuration
```
---
## Best Practices
1. **Equal Conditions:** Ensure both deployments have identical worker counts and health before benchmarking
2. **Warm-Up:** Run a small test (100 requests) before the full benchmark to warm up caches
3. **Multiple Runs:** Run benchmarks 3+ times and average results for statistical significance
4. **Monitor Workers:** Watch for any pod restarts or issues during benchmark runs
5. **Document Conditions:** Record cluster state, worker health, and any anomalies
6. **Test Relevant Workloads:** Use datasets that match your actual use case for meaningful results
---
## Conclusion
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit.
For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.
---
## Appendix: Files Reference
- `router-off-deployment.yaml`: Standard routing deployment
- `router-on-deployment.yaml`: KV router enabled deployment
- `benchmark-job.yaml`: AIPerf benchmark pod
- `prepare-dataset.sh`: Dataset preparation script
- Results CSVs: Detailed metrics from AIPerf
[https://github.com/ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo)