Additional ResourcesBackend DetailsvLLM

Running Meta-Llama-3.1-8B-Instruct with Speculative Decoding (Eagle3)

View as Markdown

This guide walks through how to deploy Meta-Llama-3.1-8B-Instruct using aggregated speculative decoding with Eagle3 on a single node. Since the model is only 8B parameters, you can run it on any GPU with at least 16GB VRAM.

Step 1: Set Up Your Docker Environment

First, we’ll initialize a Docker container using the VLLM backend. You can refer to the VLLM Quickstart Guide — or follow the full steps below.

1. Launch Docker Compose

$docker compose -f deploy/docker-compose.yml up -d

2. Build the Container

$./container/build.sh --framework VLLM

3. Run the Container

$./container/run.sh -it --framework VLLM --mount-workspace

Step 2: Get Access to the Llama-3 Model

The Meta-Llama-3.1-8B-Instruct model is gated, so you’ll need to request access on Hugging Face. Go to the official Meta-Llama-3.1-8B-Instruct repository and fill out the access form. Approval usually takes around 5 minutes.

Once you have access, generate a Hugging Face access token with permission for gated repositories, then set it inside your container:

$export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
$export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN

Step 3: Run Aggregated Speculative Decoding

Now that your environment is ready, start the aggregated server with speculative decoding.

$# Requires only one GPU
$cd examples/backends/vllm
$bash launch/agg_spec_decoding.sh

Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.

Step 4: Example Request

To verify your setup, try sending a simple prompt to your model:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
> "messages": [
> {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
> ],
> "max_tokens": 250
> }'

Example Output

1{
2 "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
3 "choices": [
4 {
5 "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
6 "index": 0,
7 "finish_reason": "stop"
8 }
9 ],
10 "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
11 "usage": {
12 "prompt_tokens": 16,
13 "completion_tokens": 250,
14 "total_tokens": 266
15 }
16}

Additional Resources