Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the multi-node deployment instructions to set up the environment for the following scenarios:
-
Aggregated Serving: Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
-
Disaggregated Serving: Distribute the workload across two GB200x4 nodes:
- One node runs the decode worker.
- The other node runs the prefill worker.
Notes
- Make sure the (
eagle3_one_model: true) is set in the LLM API config inside theexamples/backends/trtllm/engine_configs/llama4/eaglefolder.
Setup
Assuming you have already allocated your nodes via salloc, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
See this section from multinode guide to learn more about the above options.
Aggregated Serving
Disaggregated Serving
Example Request
See here to learn how to send a request to the deployment.