> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Qwen3-VL Embedding Cache A/B

Two configurations run the same `deploy.yaml` on a single aggregated GB200 worker — the only delta is the `DYN_MULTIMODAL_EMBEDDING_CACHE_GB` environment variable (10 GB for cache ON, 0 for cache OFF). With an image pool of 200 across 1,000 requests, the first 200 requests see unique images and the remaining 800 hit images the engine has already encoded, so a cache hit skips the vision encoder on the prefill path. Enabling the cache delivers **+16.4% output throughput and −27.7% average TTFT**.

<p>
  Benchmark setup
</p>

<b>Model</b> Qwen/Qwen3-VL-30B-A3B-Instruct-FP8

<b>GPUs</b> 1x GB200 (one aggregated replica)

<b>Runtime</b> vLLM

<b>Workload</b> 1,000 single-turn multimodal requests, 1 image each from a 200-image pool (80% image reuse), 400 text tokens, concurrency 64

<b>Metrics</b> Output TPS, TTFT, ITL, and request latency

<b>Held constant</b> Model, vLLM runtime, one aggregated GB200 replica, generated dataset, request count, concurrency, and forced 150-token outputs

## Results

Enabling the embedding cache on a single aggregated GB200 replica with the vLLM backend delivers **+16% throughput, -28% TTFT, and -13% request latency** (single representative run, reproduced from the [recipe README](https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-vl-30b/README.md)):

| Metric                   | Cache ON | Cache OFF |  Delta |
| ------------------------ | -------: | --------: | -----: |
| Output TPS (tok/s)       |  3,575.6 |   3,072.3 | +16.4% |
| TTFT avg (ms)            |    526.0 |     727.5 | -27.7% |
| TTFT p50 (ms)            |    356.8 |     510.8 | -30.1% |
| ITL avg (ms)             |     14.1 |      15.5 |  -8.8% |
| Request latency avg (ms) |  2,630.0 |   3,035.7 | -13.4% |

## Compared Configurations

<table>
  <thead>
    <tr><th>Role</th><th>Configuration</th><th>Deploy</th><th>Benchmark</th></tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <em>Comparison</em>
      </td>

      <td>
        <strong>Embedding cache ON</strong>

        DYN\_MULTIMODAL\_EMBEDDING\_CACHE\_GB=10 (deploy.yaml default)
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/perf.yaml">perf.yaml</a>
      </td>
    </tr>

    <tr>
      <td>
        <em>Baseline</em>
      </td>

      <td>
        <strong>Embedding cache OFF</strong>

        Set DYN\_MULTIMODAL\_EMBEDDING\_CACHE\_GB=0 in deploy.yaml and CACHE\_MODE=cache\_off in perf.yaml
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/perf.yaml">perf.yaml</a>
      </td>
    </tr>
  </tbody>
</table>

## Reproduce

A dataset-generation job creates the synthetic multimodal dataset (`qwen3_vl_1000req_1img_pool200.jsonl`: 1,000 requests, 1 image per request, 200-image pool, 400 text tokens per request) on the `perf-cache` PVC. The `perf.yaml` then wraps this AIPerf command:

```bash
aiperf profile --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
  --input-file /perf-cache/datasets/qwen3_vl_1000req_1img_pool200.jsonl \
  --custom-dataset-type single_turn \
  --url http://qwen3-vl-agg-frontend:8000 --streaming \
  --request-count 1000 --concurrency 64 --warmup-request-count 3 \
  --extra-inputs max_tokens:150 --extra-inputs min_tokens:150 \
  --extra-inputs ignore_eos:true
```

Run each configuration in sequence — redeploy with the toggled cache setting between runs:

```bash
export NAMESPACE=your-namespace

# One-time prep: storage, model download, dataset generation
kubectl apply -f recipes/qwen3-vl-30b/model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f recipes/qwen3-vl-30b/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
kubectl apply -f recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/qwen3-vl-30b-generate-datasets -n ${NAMESPACE} --timeout=3600s

# Cache ON (deploy.yaml default), then benchmark
kubectl apply -f recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Ready dynamographdeployment/qwen3-vl-agg -n ${NAMESPACE} --timeout=900s
kubectl apply -f recipes/qwen3-vl-30b/vllm/agg-embedding-cache/perf.yaml -n ${NAMESPACE}

# Cache OFF: set DYN_MULTIMODAL_EMBEDDING_CACHE_GB=0 in deploy.yaml and
# CACHE_MODE=cache_off in perf.yaml, then re-apply both.
```

The helper script `recipes/qwen3-vl-30b/vllm/agg-embedding-cache/run-benchmark.sh` automates each run — pass `on` or `off` to run one cache mode per invocation. AIPerf artifacts land under `/perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg/<cache_mode>`.

## Notes

* Exact cache hit rates cannot be pinned via the dataset because of LRU eviction; shrinking the image pool relative to request count (or growing the cache) raises the hit probability.
* The aggregated embedding cache uses vLLM's native `ec_both` ECConnector role, supported in vLLM 0.17+ with no patches — see [multimodal vLLM docs](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache).
* Replace the `storageClassName` and `image:` placeholders in the YAML files before running.
* Source: [recipes/qwen3-vl-30b](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-vl-30b)

## Winning Configuration

The cache-ON configuration is the winning configuration and is deployable from its assets above; a recommended Recipe may be promoted from this benchmark in a future release. The cache-OFF configuration is the same manifest with the cache disabled, kept as the benchmark control.