Kimi-K2.5 Feature-Stack Benchmark
Kimi-K2.5 Feature-Stack Benchmark
Four configurations run Dynamo + TensorRT-LLM on 6x GB200 nodes (24 GPUs, MNNVL), starting from plain aggregated round-robin serving and adding one feature at a time up to the full disaggregated stack. The full stack delivers roughly 3x the per-GPU throughput of the baseline while also improving per-user token speed.
Benchmark setup
Results
The disaggregated configuration with KV-aware routing, Eagle3 decoding, and KV offloading achieves the best system throughput and interactivity. Each row is that configuration’s chosen operating point on the source Pareto plot — concurrency differs by row and the values are approximate plot readings, so read them as per-configuration operating points rather than an equal-load sweep:
The full disaggregated stack dominates the throughput-interactivity Pareto frontier in the source plot: roughly 3x the per-GPU throughput of the plain aggregated baseline with better per-user token speed.
Compared Configurations
Reproduce
The trace emulates a long-context, KV-reuse-heavy agentic coding workload (~200k-token context window, multi-turn sessions with restart-splits and a layered prefix-cache model). Generate it following the dataset instructions in the AIPerf repository, then copy it to /model-cache/traces/agent_trace_data/dataset.jsonl on the PVC.
Each configuration’s perf.yaml runs a warmup pass and then wraps this AIPerf command (concurrency 32 for the disaggregated configuration, 24 for the aggregated Eagle3 configurations, 8 for the baseline):
Deploy one configuration at a time — each is sized for the full 24 GPUs:
Notes
- The manifests ship with a placeholder image tag (
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:<IMAGE_TAG>) — set a Dynamo TRT-LLM runtime image (v1.1.1~) that supports Kimi-K2.5 + Eagle3 in eachdeploy.yamlbefore applying. - Your HuggingFace token needs access to both
nvidia/Kimi-K2.5-NVFP4and thenvidia/Kimi-K2.5-Thinking-Eagle3speculative-decoding head. - If you rename the ComputeDomain CR, mirror the change in every
deploy.yamlunderextraPodSpec.resourceClaimsandresources.claims. - Source: recipes/kimi-k2.5
Winning Configuration
The disaggregated Eagle3 + KV router + offload configuration is the winner and is deployable from its assets above. A recommended Recipe may be promoted from this benchmark in a future release; the aggregated configurations exist as benchmark steps and controls.