Speculative Decoding with vLLM
Using Speculative Decoding with the vLLM backend.
See also: Speculative Decoding Overview for cross-backend documentation.
Prerequisites
- vLLM container with Eagle3 support
- GPU with at least 16GB VRAM
- Hugging Face access token (for gated models)
Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3
This guide walks through deploying Meta-Llama-3.1-8B-Instruct with Eagle3 speculative decoding on a single node.
Step 1: Set Up Your Docker Environment
First, initialize a Docker container using the vLLM backend. See the vLLM Quickstart Guide for details.
Step 2: Get Access to the Llama-3 Model
The Meta-Llama-3.1-8B-Instruct model is gated. Request access on Hugging Face: Meta-Llama-3.1-8B-Instruct repository
Approval time varies depending on Hugging Face review traffic.
Once approved, set your access token inside the container:
Step 3: Run Aggregated Speculative Decoding
Once the weights finish downloading, the server will be ready for inference requests.
Step 4: Test the Deployment
Example Output
Configuration
Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:
- Target model:
meta-llama/Meta-Llama-3.1-8B-Instruct - Draft model: Eagle3 variant
- Aggregated serving mode
See examples/backends/vllm/launch/agg_spec_decoding.sh for the full configuration.
Limitations
- Currently only supports Eagle3 as the draft model
- Requires compatible model architectures between target and draft