> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Kimi-K2.5 Feature-Stack Benchmark

Four configurations run Dynamo + TensorRT-LLM on 6x GB200 nodes (24 GPUs, MNNVL), starting from plain aggregated round-robin serving and adding one feature at a time up to the full disaggregated stack. The full stack delivers roughly **3x the per-GPU throughput** of the baseline while also improving per-user token speed.

<p>
  Benchmark setup
</p>

<b>Model</b> nvidia/Kimi-K2.5-NVFP4

<b>GPUs</b> 24x GB200 (6 nodes, MNNVL)

<b>Runtime</b> TensorRT-LLM

<b>Workload</b> Mooncake-style agentic coding trace (\~200K-token context, multi-turn), one-hour replay

<b>Metrics</b> tok/s/user, tok/s/GPU, goodput at TTFT 5s / ITL 10ms

<b>Held constant</b> Model, runtime, GPU count, trace, duration, and goodput thresholds across all configurations

## Results

The disaggregated configuration with KV-aware routing, Eagle3 decoding, and KV offloading achieves the best system throughput and interactivity. Each row is that configuration's chosen operating point on the source Pareto plot — concurrency differs by row and the values are approximate plot readings, so read them as per-configuration operating points rather than an equal-load sweep:

| Configuration                          | Concurrency | tok/s/user (avg) | tok/s/GPU |
| -------------------------------------- | ----------: | ---------------: | --------: |
| Disagg + Eagle3 + KV routing + offload |          32 |            \~130 |   \~5,400 |
| Agg + Eagle3 + KV routing              |          24 |             \~85 |   \~4,400 |
| Agg + Eagle3 + round-robin             |          24 |             \~95 |   \~4,000 |
| Agg + round-robin (no Eagle3)          |           8 |            \~105 |   \~1,700 |

The full disaggregated stack dominates the throughput-interactivity Pareto frontier in the source plot: roughly **3x the per-GPU throughput** of the plain aggregated baseline with better per-user token speed.

## Compared Configurations

<table>
  <thead>
    <tr><th>Role</th><th>Configuration</th><th>Deploy</th><th>Benchmark</th></tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <em>Winner</em>
      </td>

      <td>
        <strong>Disagg + Eagle3 + KV router + offload</strong>

        3x DEP4 prefill + 3x TEP4 decode, concurrency 32
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/disagg-eagle-kv-router/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/disagg-eagle-kv-router/perf.yaml">perf.yaml</a>
      </td>
    </tr>

    <tr>
      <td>
        <em>Comparison</em>
      </td>

      <td>
        <strong>Agg + Eagle3 + KV router</strong>

        3x TEP8 aggregated, concurrency 24 — routing plus speculation
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/agg-eagle-kv-router/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/agg-eagle-kv-router/perf.yaml">perf.yaml</a>
      </td>
    </tr>

    <tr>
      <td>
        <em>Comparison</em>
      </td>

      <td>
        <strong>Agg + Eagle3 + round-robin</strong>

        3x TEP8 aggregated, concurrency 24 — speculation without KV-aware routing
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/agg-eagle-round-robin/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/agg-eagle-round-robin/perf.yaml">perf.yaml</a>
      </td>
    </tr>

    <tr>
      <td>
        <em>Baseline</em>
      </td>

      <td>
        <strong>Agg + round-robin</strong>

        3x TEP8 aggregated, concurrency 8 — no speculation, no P/D split
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/agg-round-robin/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/kimi-k2.5/trtllm/agg-round-robin/perf.yaml">perf.yaml</a>
      </td>
    </tr>
  </tbody>
</table>

## Reproduce

The trace emulates a long-context, KV-reuse-heavy agentic coding workload (\~200k-token context window, multi-turn sessions with restart-splits and a layered prefix-cache model). Generate it following the [dataset instructions in the AIPerf repository](https://github.com/ai-dynamo/aiperf/blob/1ecc2eac988eedc0e3a79b4c2d1063bfc295a014/src/aiperf/dataset/agentic_code_gen/datasets/1k_sessions_200k_ctx/manifest.json), then copy it to `/model-cache/traces/agent_trace_data/dataset.jsonl` on the PVC.

Each configuration's `perf.yaml` runs a warmup pass and then wraps this AIPerf command (concurrency 32 for the disaggregated configuration, 24 for the aggregated Eagle3 configurations, 8 for the baseline):

```bash
aiperf profile -m nvidia/Kimi-K2.5-NVFP4 \
  --tokenizer nvidia/Kimi-K2.5-NVFP4 --tokenizer-trust-remote-code \
  --input-file /model-cache/traces/agent_trace_data/dataset.jsonl \
  --custom-dataset-type mooncake_trace \
  --url http://<frontend>:8000 \
  --streaming --extra-inputs ignore_eos:true \
  --concurrency <8|24|32> --random-seed 42 \
  --benchmark-duration 3600 --concurrency-ramp-duration 60 \
  --goodput "time_to_first_token:5000 inter_token_latency:10"
```

Deploy one configuration at a time — each is sized for the full 24 GPUs:

```bash
export NAMESPACE=your-namespace

# One-time prep: storage, ComputeDomain, model + Eagle3 head download
kubectl apply -f recipes/kimi-k2.5/model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f recipes/kimi-k2.5/model-cache/compute-domain.yaml -n ${NAMESPACE}
kubectl apply -f recipes/kimi-k2.5/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

# Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
kubectl apply -f recipes/kimi-k2.5/trtllm/<configuration>/deploy.yaml -n ${NAMESPACE}
kubectl apply -f recipes/kimi-k2.5/trtllm/<configuration>/perf.yaml -n ${NAMESPACE}
```

## Notes

* The manifests ship with a placeholder image tag (`nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:<IMAGE_TAG>`) — set a Dynamo TRT-LLM runtime image (v1.1.1\~) that supports Kimi-K2.5 + Eagle3 in each `deploy.yaml` before applying.
* Your HuggingFace token needs access to both `nvidia/Kimi-K2.5-NVFP4` and the `nvidia/Kimi-K2.5-Thinking-Eagle3` speculative-decoding head.
* If you rename the ComputeDomain CR, mirror the change in every `deploy.yaml` under `extraPodSpec.resourceClaims` and `resources.claims`.
* Source: [recipes/kimi-k2.5](https://github.com/ai-dynamo/dynamo/tree/main/recipes/kimi-k2.5)

## Winning Configuration

The disaggregated Eagle3 + KV router + offload configuration is the winner and is deployable from its assets above. A recommended Recipe may be promoted from this benchmark in a future release; the aggregated configurations exist as benchmark steps and controls.