> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt. # Kimi-K2.5 Feature-Stack Benchmark Four configurations run Dynamo + TensorRT-LLM on 6x GB200 nodes (24 GPUs, MNNVL), starting from plain aggregated round-robin serving and adding one feature at a time up to the full disaggregated stack. The full stack delivers roughly **3x the per-GPU throughput** of the baseline while also improving per-user token speed.

Benchmark setup

Model nvidia/Kimi-K2.5-NVFP4 GPUs 24x GB200 (6 nodes, MNNVL) Runtime TensorRT-LLM Workload Mooncake-style agentic coding trace (\~200K-token context, multi-turn), one-hour replay Metrics tok/s/user, tok/s/GPU, goodput at TTFT 5s / ITL 10ms Held constant Model, runtime, GPU count, trace, duration, and goodput thresholds across all configurations ## Results The disaggregated configuration with KV-aware routing, Eagle3 decoding, and KV offloading achieves the best system throughput and interactivity. Each row is that configuration's chosen operating point on the source Pareto plot — concurrency differs by row and the values are approximate plot readings, so read them as per-configuration operating points rather than an equal-load sweep: | Configuration | Concurrency | tok/s/user (avg) | tok/s/GPU | | -------------------------------------- | ----------: | ---------------: | --------: | | Disagg + Eagle3 + KV routing + offload | 32 | \~130 | \~5,400 | | Agg + Eagle3 + KV routing | 24 | \~85 | \~4,400 | | Agg + Eagle3 + round-robin | 24 | \~95 | \~4,000 | | Agg + round-robin (no Eagle3) | 8 | \~105 | \~1,700 | The full disaggregated stack dominates the throughput-interactivity Pareto frontier in the source plot: roughly **3x the per-GPU throughput** of the plain aggregated baseline with better per-user token speed. ## Compared Configurations

Role	Configuration	Deploy	Benchmark
Winner	Disagg + Eagle3 + KV router + offload 3x DEP4 prefill + 3x TEP4 decode, concurrency 32	deploy.yaml	perf.yaml
Comparison	Agg + Eagle3 + KV router 3x TEP8 aggregated, concurrency 24 — routing plus speculation	deploy.yaml	perf.yaml
Comparison	Agg + Eagle3 + round-robin 3x TEP8 aggregated, concurrency 24 — speculation without KV-aware routing	deploy.yaml	perf.yaml
Baseline	Agg + round-robin 3x TEP8 aggregated, concurrency 8 — no speculation, no P/D split	deploy.yaml	perf.yaml

## Reproduce The trace emulates a long-context, KV-reuse-heavy agentic coding workload (\~200k-token context window, multi-turn sessions with restart-splits and a layered prefix-cache model). Generate it following the [dataset instructions in the AIPerf repository](https://github.com/ai-dynamo/aiperf/blob/1ecc2eac988eedc0e3a79b4c2d1063bfc295a014/src/aiperf/dataset/agentic_code_gen/datasets/1k_sessions_200k_ctx/manifest.json), then copy it to `/model-cache/traces/agent_trace_data/dataset.jsonl` on the PVC. Each configuration's `perf.yaml` runs a warmup pass and then wraps this AIPerf command (concurrency 32 for the disaggregated configuration, 24 for the aggregated Eagle3 configurations, 8 for the baseline): ```bash aiperf profile -m nvidia/Kimi-K2.5-NVFP4 \ --tokenizer nvidia/Kimi-K2.5-NVFP4 --tokenizer-trust-remote-code \ --input-file /model-cache/traces/agent_trace_data/dataset.jsonl \ --custom-dataset-type mooncake_trace \ --url http://:8000 \ --streaming --extra-inputs ignore_eos:true \ --concurrency <8|24|32> --random-seed 42 \ --benchmark-duration 3600 --concurrency-ramp-duration 60 \ --goodput "time_to_first_token:5000 inter_token_latency:10" ``` Deploy one configuration at a time — each is sized for the full 24 GPUs: ```bash export NAMESPACE=your-namespace # One-time prep: storage, ComputeDomain, model + Eagle3 head download kubectl apply -f recipes/kimi-k2.5/model-cache/model-cache.yaml -n ${NAMESPACE} kubectl apply -f recipes/kimi-k2.5/model-cache/compute-domain.yaml -n ${NAMESPACE} kubectl apply -f recipes/kimi-k2.5/model-cache/model-download.yaml -n ${NAMESPACE} kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s # Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml. kubectl apply -f recipes/kimi-k2.5/trtllm//deploy.yaml -n ${NAMESPACE} kubectl apply -f recipes/kimi-k2.5/trtllm//perf.yaml -n ${NAMESPACE} ``` ## Notes * The manifests ship with a placeholder image tag (`nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:`) — set a Dynamo TRT-LLM runtime image (v1.1.1\~) that supports Kimi-K2.5 + Eagle3 in each `deploy.yaml` before applying. * Your HuggingFace token needs access to both `nvidia/Kimi-K2.5-NVFP4` and the `nvidia/Kimi-K2.5-Thinking-Eagle3` speculative-decoding head. * If you rename the ComputeDomain CR, mirror the change in every `deploy.yaml` under `extraPodSpec.resourceClaims` and `resources.claims`. * Source: [recipes/kimi-k2.5](https://github.com/ai-dynamo/dynamo/tree/main/recipes/kimi-k2.5) ## Winning Configuration The disaggregated Eagle3 + KV router + offload configuration is the winner and is deployable from its assets above. A recommended Recipe may be promoted from this benchmark in a future release; the aggregated configurations exist as benchmark steps and controls.