--- title: Dynamo KV Smart Router A/B Benchmarking Guide --- This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster. ## Overview Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you: 1. Deploy two identical Dynamo configurations: a. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITHOUT** KV Smart Router enabled b. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITH** KV Smart Router enabled 2. Run controlled benchmarks using AIPerf 3. Compare performance metrics to evaluate KV router effectiveness Kubernetes cluster with GPUs, kubectl, helm --- ## Prerequisites ### Required Tools - `kubectl` (configured with cluster access) - `helm` (v3+) - HuggingFace account and token (if model downloads are gated) - Kubernetes cluster with: - GPU nodes (H100, H200, or similar) - Sufficient GPU capacity (16+ GPUs recommended for this example) - Dynamo platform installed globally OR ability to install per-namespace ### Knowledge Requirements - Basic Kubernetes concepts (namespaces, pods, services) - Familiarity with LLM inference concepts - Command-line proficiency --- ## Architecture This guide sets up two parallel deployments, as well as a benchmarking pod that can test each deployment: ```text ┌─────────────────────────────────────┐ │ Deployment A: Router OFF │ │ Namespace: router-off-test │ │ ├─ Frontend (Standard Routing) │ │ └─ 8x Decode Workers (1 GPU each) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Deployment B: Router ON │ │ Namespace: router-on-test │ │ ├─ Frontend (KV Smart Router) │ │ └─ 8x Decode Workers (1 GPU each) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Benchmark Pod │ │ Namespace: benchmark │ │ └─ AIPerf + Dataset │ └─────────────────────────────────────┘ ``` **Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing. --- ## Phase 1: Namespace and Infrastructure Setup ### Step 1.1: Create Namespaces ```bash # Create namespaces for both deployments kubectl create namespace router-off-test kubectl create namespace router-on-test kubectl create namespace benchmark ``` ### Step 1.2: Create HuggingFace Token Secret (optional) If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token: ```bash # Router-OFF namespace kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="YOUR_HF_TOKEN" \ -n router-off-test # Router-ON namespace kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="YOUR_HF_TOKEN" \ -n router-on-test ``` ### Step 1.3: Install Dynamo Platform (Per-Namespace) If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/v0.8.0/docs/kubernetes/installation-guide.md) to install the platform in both namespaces: - `router-off-test` - `router-on-test` **Key Configuration Notes:** - If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation - Adjust version tags to match your cluster's available Dynamo versions - If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation ### Step 1.4: Verify Infrastructure Wait for operators and infrastructure to be ready: ```bash # Check router-off-test kubectl get pods -n router-off-test # Check router-on-test kubectl get pods -n router-on-test ``` You should see: - `dynamo-platform-dynamo-operator-controller-manager` (2/2 Running) - `dynamo-platform-etcd-0` (1/1 Running) - `dynamo-platform-nats-0` (2/2 Running) --- ## Phase 2: Deploy Model Serving ### Step 2.1: Create Deployment YAMLs Create `router-off-deployment.yaml`: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-agg-no-router spec: services: Frontend: dynamoNamespace: vllm-agg-no-router componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 VllmDecodeWorker: envFromSecret: hf-token-secret dynamoNamespace: vllm-agg-no-router componentType: worker replicas: 8 resources: limits: gpu: "1" extraPodSpec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node.kubernetes.io/instance-type operator: In values: - gpu-h200-sxm # Adjust to your GPU node type mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 workingDir: /workspace/examples/backends/vllm command: - /bin/sh - -c args: - python3 -m dynamo.vllm --model Qwen/Qwen3-32B --quantization fp8 startupProbe: httpGet: path: /health port: 9090 initialDelaySeconds: 120 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 60 # 32 minutes total (120s + 60*30s) livenessProbe: httpGet: path: /live port: 9090 initialDelaySeconds: 300 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 10 readinessProbe: httpGet: path: /live port: 9090 initialDelaySeconds: 300 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 10 ``` Create `router-on-deployment.yaml`: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-agg-router spec: services: Frontend: dynamoNamespace: vllm-agg-router componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 envs: - name: DYN_ROUTER_MODE value: kv # KEY DIFFERENCE: Enable KV Smart Router VllmDecodeWorker: envFromSecret: hf-token-secret dynamoNamespace: vllm-agg-router componentType: worker replicas: 8 resources: limits: gpu: "1" extraPodSpec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node.kubernetes.io/instance-type operator: In values: - gpu-h200-sxm # Adjust to your GPU node type mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 workingDir: /workspace/examples/backends/vllm command: - /bin/sh - -c args: - python3 -m dynamo.vllm --model Qwen/Qwen3-32B --quantization fp8 startupProbe: httpGet: path: /health port: 9090 initialDelaySeconds: 120 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 60 # 32 minutes total (120s + 60*30s) livenessProbe: httpGet: path: /live port: 9090 initialDelaySeconds: 300 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 10 readinessProbe: httpGet: path: /live port: 9090 initialDelaySeconds: 300 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 10 ``` ### Step 2.2: Deploy Both Configurations ```bash # Deploy router-OFF kubectl apply -f router-off-deployment.yaml -n router-off-test # Deploy router-ON kubectl apply -f router-on-deployment.yaml -n router-on-test ``` **💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model. First, create the PVC separately: ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache spec: accessModes: - ReadWriteMany storageClassName: "your-shared-storage-class" # e.g., nfs, efs, nebius-shared-fs resources: requests: storage: 100Gi ``` Then reference it in your DynamoGraphDeployment: ```yaml spec: pvcs: - create: false name: model-cache size: "0" services: VllmDecodeWorker: volumeMounts: - mountPoint: /root/.cache/huggingface name: model-cache useAsCompilationCache: false ``` With this configuration, only the first worker downloads the model; others use the cached version, reducing startup time from 20+ minutes to ~2 minutes per pod. ### Step 2.3: Monitor Deployment Progress ```bash # Watch router-OFF pods kubectl get pods -n router-off-test -w # Watch router-ON pods kubectl get pods -n router-on-test -w ``` Wait for all pods to reach `Running` status and pass readiness probes. **Expected Timeline:** - **With shared PVC** (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache) - **Without shared PVC**: 20-30 minutes per worker (workers download independently) - For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling) The startup probe allows 32 minutes per pod (failureThreshold: 60), which accommodates model download and initialization. ### Step 2.4: Verify All Workers Are Healthy > ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health in both deployments. Unequal worker counts will invalidate your comparison results. ```bash # Quick health check - both should show "8/8" echo "Router OFF: $(kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready" echo "Router ON: $(kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready" # Detailed view kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker ``` **Both must show 8/8 workers in Ready state (1/1 Running).** If workers are not ready: - Check logs: `kubectl logs -n

` - Common issues: model download in progress, startup probe timeout, insufficient GPU resources **Do not proceed with benchmarks until all 16 workers (8 per deployment) are healthy.** --- ## Phase 3: Prepare Benchmark Dataset ### Understanding the Mooncake Trace Dataset For this A/B comparison, we use the **Mooncake Trace Dataset**, published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake). This is a privacy-preserving dataset of real-world LLM inference traffic from production arxiv workloads. **What's in the dataset?** Each trace entry contains: - **Timestamp:** When the request arrived (for realistic request timing) - **Input/output lengths:** Number of tokens in prompts and responses - **Block hash IDs:** Cryptographic hashes representing KV cache blocks (explained below) **Sample trace entry:** ```json { "timestamp": 27482, "input_length": 6955, "output_length": 52, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354] } ``` ### Why Mooncake Traces Matter for KV Cache Benchmarking **The Challenge:** Traditional LLM benchmarks use synthetic or random data, which are often insufficient to capture real-world optimizations like KV Smart Router. To properly evaluate this feature, we need realistic traffic patterns with **prefix repetition** - but this creates a privacy problem: how do we measure realistic KV cache hit patterns without exposing actual user conversations? **Mooncake's Solution: Privacy-Preserving Block Hashes** Instead of storing actual prompt text, the Mooncake dataset uses cryptographic hashes to represent KV cache blocks. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks. This preserves the **pattern of prefix reuse** while completely protecting user privacy. ### How it works - Multi-turn conversation example ```text Turn 1 (initial request - long document analysis): Input: ~8,000 tokens (e.g., research paper + question) Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61] └─ 16 blocks × 512 tokens/block = ~8,192 tokens Turn 2 (follow-up question on same document): Input: Same document + new question (~8,500 tokens) Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62] └──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘ ✅ Cache hit: First 8,192 tokens don't need recomputation! Turn 3 (another follow-up): Input: Same document + different question (~9,000 tokens) Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63] └──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘ ``` When requests share the same hash IDs (e.g., blocks 46-61), it means they share those 512-token blocks - indicating **significant prefix overlap** (in this case, 8,192 tokens). The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits and avoiding redundant computation for those shared prefix tokens. **Key Dataset Properties:** - ✅ **Realistic timing:** Request arrival patterns from production workloads - ✅ **Real prefix patterns:** Up to 50% cache hit ratio ([Mooncake technical report](https://github.com/kvcache-ai/Mooncake)) - ✅ **Privacy-preserving:** No actual text - only hash-based cache block identifiers - ✅ **Reproducible:** Public dataset enables fair comparisons across different systems **Why this matters:** With random synthetic data, the KV Smart Router would show no benefit because there's no prefix reuse to exploit. Mooncake traces provide realistic workload patterns that demonstrate the router's real-world performance gains while respecting user privacy. --- ### Download and Prepare the Dataset ```bash # Download the Mooncake arxiv trace dataset curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl -o mooncake_trace.jsonl # Trim to 1000 requests for faster benchmarking head -n 1000 mooncake_trace.jsonl > mooncake_trace_small.jsonl # Speed up timestamps 4x (reduces benchmark time from ~12 min to ~3 min) python3 - <<'PY' import json with open("mooncake_trace_small.jsonl") as src, open("mooncake_trace_4x.jsonl", "w") as dst: for line in src: rec = json.loads(line) rec["timestamp"] = int(rec["timestamp"] / 4) dst.write(json.dumps(rec) + "\n") PY echo "Dataset ready: mooncake_trace_4x.jsonl (1000 requests, 4x speed)" ``` --- ## Phase 4: Set Up Benchmark Environment ### Step 4.1: Deploy Benchmark Pod Create `benchmark-job.yaml`: ```yaml apiVersion: batch/v1 kind: Job metadata: name: aiperf-benchmark namespace: benchmark spec: backoffLimit: 1 template: spec: restartPolicy: Never containers: - name: benchmark image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 command: ["/bin/sh", "-c", "sleep infinity"] imagePullPolicy: IfNotPresent resources: limits: nvidia.com/gpu: 0 ``` Deploy: ```bash kubectl apply -f benchmark-job.yaml ``` Wait for pod to be ready: ```bash kubectl get pods -n benchmark ``` ### Step 4.2: Copy Dataset to Benchmark Pod ```bash POD_NAME=$(kubectl get pods -n benchmark -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}') kubectl -n benchmark cp mooncake_trace_4x.jsonl ${POD_NAME}:/tmp/mooncake_trace_4x.jsonl ``` ### Step 4.3: Install AIPerf ```bash kubectl -n benchmark exec ${POD_NAME} -- bash -lc '. /opt/dynamo/venv/bin/activate && pip install -q aiperf' ``` --- ## Phase 5: Run Benchmarks ### Step 5.1: Benchmark Router-OFF (Baseline) ```bash kubectl -n benchmark exec ${POD_NAME} -- bash -lc ' . /opt/dynamo/venv/bin/activate aiperf profile \ --model "Qwen/Qwen3-32B" \ --url "http://vllm-agg-no-router-frontend.router-off-test.svc.cluster.local:8000" \ --endpoint-type chat \ --input-file /tmp/mooncake_trace_4x.jsonl \ --custom-dataset-type mooncake_trace \ --tokenizer "Qwen/Qwen3-32B" \ --streaming \ --request-count 1000 \ --fixed-schedule \ --output-artifact-dir /tmp/router_off_results ' ``` This will take 3-5 minutes. The terminal output includes a summary table. ### Step 5.2: Benchmark Router-ON (KV Smart Router) ```bash kubectl -n benchmark exec ${POD_NAME} -- bash -lc ' . /opt/dynamo/venv/bin/activate aiperf profile \ --model "Qwen/Qwen3-32B" \ --url "http://vllm-agg-router-frontend.router-on-test.svc.cluster.local:8000" \ --endpoint-type chat \ --input-file /tmp/mooncake_trace_4x.jsonl \ --custom-dataset-type mooncake_trace \ --tokenizer "Qwen/Qwen3-32B" \ --streaming \ --request-count 1000 \ --fixed-schedule \ --output-artifact-dir /tmp/router_on_results ' ``` ### Step 5.3: Collect Results ```bash # Copy results to local machine kubectl -n benchmark cp ${POD_NAME}:/tmp/router_off_results/profile_export_aiperf.csv ./router_off_results.csv kubectl -n benchmark cp ${POD_NAME}:/tmp/router_on_results/profile_export_aiperf.csv ./router_on_results.csv ``` --- ## Phase 6: Analyze Results ### Key Metrics to Compare | Metric | Description | What to Look For | |--------|-------------|------------------| | **Time to First Token (TTFT)** | Latency until first token arrives | Lower is better; KV router may reduce with prefix reuse | | **Inter Token Latency (ITL)** | Average time between tokens | Lower is better; indicates generation speed | | **Request Latency** | Total end-to-end latency | Lower is better; overall user experience | | **Output Token Throughput** | Tokens generated per second (system-wide) | Higher is better; system efficiency | | **Request Throughput** | Requests completed per second | Higher is better; capacity | ### Interpreting Results **Your Results May Vary**: The improvement from KV Smart Router depends heavily on your workload characteristics: **Factors that increase KV router benefit:** - **High prefix overlap** (shared system prompts, templates, document contexts) - **Long prompts** (>2000 tokens) where caching saves significant compute - **Multi-turn conversations** with context carryover - **Batch workloads** with similar queries **Factors that reduce KV router benefit:** - **Unique prompts** with no prefix reuse - **Short prompts** (\<1000 tokens) where routing overhead exceeds benefit - **Evenly distributed load** where round-robin is already optimal - **Low request rate** where cache eviction negates benefits **Expected Performance:** - **High prefix overlap workloads**: 20-50% TTFT improvement - **Moderate prefix overlap**: 10-20% improvement - **Low prefix overlap**: \<5% improvement (may not be worth enabling) **KV Smart Router is beneficial when:** - TTFT improvements > 20% - No significant degradation in other metrics - Workload demonstrates measurable prefix reuse patterns **Standard routing is better when:** - KV router shows \<10% improvement - Increased latency variance is observed - Load distribution across workers is more important than cache affinity ### Example Comparison From the terminal output, compare the summary tables: ``` Router-OFF (Baseline): TTFT avg: 12,764 ms p99: 45,898 ms Request Latency avg: 32,978 ms Output Token Throughput: 1,614 tokens/sec Request Throughput: 8.61 req/sec Router-ON (KV Router): TTFT avg: 8,012 ms p99: 28,644 ms (37% faster ✅) Request Latency avg: 28,972 ms (12% faster ✅) Output Token Throughput: 1,746 tokens/sec (8% higher ✅) Request Throughput: 9.33 req/sec (8% higher ✅) ``` In this example with all 8 workers healthy, the **KV router significantly outperformed** the baseline: - **37% faster TTFT** - Users see first token much sooner - **8% higher throughput** - System processes more requests per second - **12% lower latency** - Faster end-to-end completion The Mooncake arxiv dataset has sufficient prefix overlap (long input sequences with similar patterns) to benefit from KV cache-aware routing. Workloads with explicit shared prefixes (system prompts, templates) may see even greater improvements. --- ## Phase 7: Cleanup ```bash # Delete deployments kubectl delete dynamographdeployment vllm-agg-no-router -n router-off-test kubectl delete dynamographdeployment vllm-agg-router -n router-on-test # Delete namespaces (removes all resources) kubectl delete namespace router-off-test kubectl delete namespace router-on-test kubectl delete namespace benchmark ``` --- ## Troubleshooting ### Issue: Pods Stuck in Pending Insufficient GPU resources **Solution:** ```bash # Check GPU availability kubectl describe nodes | grep -A 10 "Allocated resources" # Reduce worker replicas if needed kubectl edit dynamographdeployment -n ``` ### Issue: ImagePullBackOff Errors Version mismatch or missing credentials **Solution:** ```bash # Check available versions kubectl get pods -n dynamo-system -o yaml | grep image: # Update deployment YAML to match cluster version ``` ### Issue: Operator Not Processing Deployment Namespace restrictions **Solution:** - Ensure Dynamo platform is Helm-installed in the namespace - Verify operator has `--restrictedNamespace=` argument - Check operator logs: `kubectl logs -n deployment/dynamo-platform-dynamo-operator-controller-manager` ### Issue: Workers Not Becoming Ready Model download failures or probe configuration **Solution:** ```bash # Check worker logs kubectl logs -n

# Common issues: # - Invalid HuggingFace token # - Network connectivity # - Insufficient disk space for model ``` ### Issue: Workers Restarting in CrashLoopBackOff Startup probe timeout - workers killed before finishing initialization **Symptoms:** - Pods show "Container main failed startup probe, will be restarted" - Logs show model still downloading or loading when pod is killed - Large models (>30GB) take longer than default 22-minute timeout **Solution:** Increase the startup probe `failureThreshold`: ```bash # Patch the deployment to allow 32 minutes instead of 22 kubectl patch dynamographdeployment -n --type='json' \ -p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 60}]' ``` Or update your YAML before deploying: ```yaml startupProbe: httpGet: path: /health port: 9090 initialDelaySeconds: 120 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 60 # 32 minutes total (120s + 60*30s) ``` **Model Loading Times (approximate):** - Qwen3-32B: ~20-25 minutes (first download) - Llama-70B: ~25-30 minutes (first download) - With cached model on node: ~2-5 minutes ### Issue: Unequal Worker Health Resource constraints, image pull issues, or configuration errors **Solution:** ```bash # Check all worker status kubectl get pods -n -l nvidia.com/dynamo-component-type=worker # Describe problematic pods kubectl describe pod -n # Fix issues before benchmarking or results will be skewed ``` --- ## Advanced Configuration ### Testing Different Models Replace `Qwen/Qwen3-32B` with your model in: - Deployment YAML `args` section - AIPerf `--model` and `--tokenizer` parameters ### Adjusting Worker Count Change `replicas: 8` in the deployment YAMLs. Ensure both deployments use the same count for fair comparison. ### Using Custom Datasets Replace mooncake dataset with your own JSONL file: - Format: One request per line with `timestamp` field - AIPerf supports various formats via `--custom-dataset-type` ### Disaggregated Prefill/Decode For advanced testing, add separate prefill workers: ```yaml VllmPrefillWorker: componentType: worker replicas: 2 # ... configuration ``` --- ## Best Practices 1. **Equal Conditions:** Ensure both deployments have identical worker counts and health before benchmarking 2. **Warm-Up:** Run a small test (100 requests) before the full benchmark to warm up caches 3. **Multiple Runs:** Run benchmarks 3+ times and average results for statistical significance 4. **Monitor Workers:** Watch for any pod restarts or issues during benchmark runs 5. **Document Conditions:** Record cluster state, worker health, and any anomalies 6. **Test Relevant Workloads:** Use datasets that match your actual use case for meaningful results --- ## Conclusion This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub. --- ## Appendix: Files Reference - `router-off-deployment.yaml`: Standard routing deployment - `router-on-deployment.yaml`: KV router enabled deployment - `benchmark-job.yaml`: AIPerf benchmark pod - `prepare-dataset.sh`: Dataset preparation script - Results CSVs: Detailed metrics from AIPerf [https://github.com/ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo)