Kimi-K2.5 Feature-Stack Benchmark

How much throughput and interactivity do we gain as KV routing, Eagle3, disaggregation, and KV offload are composed?
View as Markdown

Four configurations run Dynamo + TensorRT-LLM on 6x GB200 nodes (24 GPUs, MNNVL), starting from plain aggregated round-robin serving and adding one feature at a time up to the full disaggregated stack. The full stack delivers roughly 3x the per-GPU throughput of the baseline while also improving per-user token speed.

Benchmark setup

Model nvidia/Kimi-K2.5-NVFP4GPUs 24x GB200 (6 nodes, MNNVL)Runtime TensorRT-LLMWorkload Mooncake-style agentic coding trace (~200K-token context, multi-turn), one-hour replayMetrics tok/s/user, tok/s/GPU, goodput at TTFT 5s / ITL 10msHeld constant Model, runtime, GPU count, trace, duration, and goodput thresholds across all configurations

Results

The disaggregated configuration with KV-aware routing, Eagle3 decoding, and KV offloading achieves the best system throughput and interactivity. Each row is that configuration’s chosen operating point on the source Pareto plot — concurrency differs by row and the values are approximate plot readings, so read them as per-configuration operating points rather than an equal-load sweep:

ConfigurationConcurrencytok/s/user (avg)tok/s/GPU
Disagg + Eagle3 + KV routing + offload32~130~5,400
Agg + Eagle3 + KV routing24~85~4,400
Agg + Eagle3 + round-robin24~95~4,000
Agg + round-robin (no Eagle3)8~105~1,700

The full disaggregated stack dominates the throughput-interactivity Pareto frontier in the source plot: roughly 3x the per-GPU throughput of the plain aggregated baseline with better per-user token speed.

Compared Configurations

RoleConfigurationDeployBenchmark
WinnerDisagg + Eagle3 + KV router + offload3x DEP4 prefill + 3x TEP4 decode, concurrency 32deploy.yamlperf.yaml
ComparisonAgg + Eagle3 + KV router3x TEP8 aggregated, concurrency 24 — routing plus speculationdeploy.yamlperf.yaml
ComparisonAgg + Eagle3 + round-robin3x TEP8 aggregated, concurrency 24 — speculation without KV-aware routingdeploy.yamlperf.yaml
BaselineAgg + round-robin3x TEP8 aggregated, concurrency 8 — no speculation, no P/D splitdeploy.yamlperf.yaml

Reproduce

The trace emulates a long-context, KV-reuse-heavy agentic coding workload (~200k-token context window, multi-turn sessions with restart-splits and a layered prefix-cache model). Generate it following the dataset instructions in the AIPerf repository, then copy it to /model-cache/traces/agent_trace_data/dataset.jsonl on the PVC.

Each configuration’s perf.yaml runs a warmup pass and then wraps this AIPerf command (concurrency 32 for the disaggregated configuration, 24 for the aggregated Eagle3 configurations, 8 for the baseline):

$aiperf profile -m nvidia/Kimi-K2.5-NVFP4 \
> --tokenizer nvidia/Kimi-K2.5-NVFP4 --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/agent_trace_data/dataset.jsonl \
> --custom-dataset-type mooncake_trace \
> --url http://<frontend>:8000 \
> --streaming --extra-inputs ignore_eos:true \
> --concurrency <8|24|32> --random-seed 42 \
> --benchmark-duration 3600 --concurrency-ramp-duration 60 \
> --goodput "time_to_first_token:5000 inter_token_latency:10"

Deploy one configuration at a time — each is sized for the full 24 GPUs:

$export NAMESPACE=your-namespace
$
$# One-time prep: storage, ComputeDomain, model + Eagle3 head download
$kubectl apply -f recipes/kimi-k2.5/model-cache/model-cache.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/kimi-k2.5/model-cache/compute-domain.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/kimi-k2.5/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
$
$# Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
$kubectl apply -f recipes/kimi-k2.5/trtllm/<configuration>/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/kimi-k2.5/trtllm/<configuration>/perf.yaml -n ${NAMESPACE}

Notes

  • The manifests ship with a placeholder image tag (nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:<IMAGE_TAG>) — set a Dynamo TRT-LLM runtime image (v1.1.1~) that supports Kimi-K2.5 + Eagle3 in each deploy.yaml before applying.
  • Your HuggingFace token needs access to both nvidia/Kimi-K2.5-NVFP4 and the nvidia/Kimi-K2.5-Thinking-Eagle3 speculative-decoding head.
  • If you rename the ComputeDomain CR, mirror the change in every deploy.yaml under extraPodSpec.resourceClaims and resources.claims.
  • Source: recipes/kimi-k2.5

Winning Configuration

The disaggregated Eagle3 + KV router + offload configuration is the winner and is deployable from its assets above. A recommended Recipe may be promoted from this benchmark in a future release; the aggregated configurations exist as benchmark steps and controls.