Speculative Decoding with vLLM | NVIDIA Dynamo Documentation

Using Speculative Decoding with the vLLM backend.

See also: Speculative Decoding Overview for cross-backend documentation.

Prerequisites

vLLM container with Eagle3 support
GPU with at least 16GB VRAM
Hugging Face access token (for gated models)

Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3

This guide walks through deploying Meta-Llama-3.1-8B-Instruct with Eagle3 speculative decoding on a single node.

Step 1: Set Up Your Docker Environment

First, initialize a Docker container using the vLLM backend. See the vLLM Quickstart Guide for details.

$ # Launch infrastructure services
$ docker compose -f deploy/docker-compose.yml up -d
$ 
$ # Build the container
$ ./container/build.sh --framework VLLM
$ 
$ # Run the container
$ ./container/run.sh -it --framework VLLM --mount-workspace

Step 2: Get Access to the Llama-3 Model

The Meta-Llama-3.1-8B-Instruct model is gated. Request access on Hugging Face: Meta-Llama-3.1-8B-Instruct repository

Approval time varies depending on Hugging Face review traffic.

Once approved, set your access token inside the container:

$ export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
$ export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN

Step 3: Run Aggregated Speculative Decoding

$ # Requires only one GPU
$ cd examples/backends/vllm
$ bash launch/agg_spec_decoding.sh

Once the weights finish downloading, the server will be ready for inference requests.

Step 4: Test the Deployment

$ curl http://localhost:8000/v1/chat/completions \
>    -H "Content-Type: application/json" \
>    -d '{
>      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
>      "messages": [
>        {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
>      ],
>      "max_tokens": 250
>    }'

Example Output

1 {
2   "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
3   "choices": [
4     {
5       "message": {
6         "role": "assistant",
7         "content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
8       },
9       "index": 0,
10       "finish_reason": "stop"
11     }
12   ],
13   "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
14   "usage": {
15     "prompt_tokens": 16,
16     "completion_tokens": 250,
17     "total_tokens": 266
18   }
19 }

Configuration

Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:

Target model: meta-llama/Meta-Llama-3.1-8B-Instruct
Draft model: Eagle3 variant
Aggregated serving mode

See examples/backends/vllm/launch/agg_spec_decoding.sh for the full configuration.

Limitations

Currently only supports Eagle3 as the draft model
Requires compatible model architectures between target and draft