Example: Multi-node TRTLLM Workers with Dynamo on Slurm for multimodal models
Note: The scripts referenced in this example (such as
srun_aggregated.shandsrun_disaggregated.sh) can be found inexamples/basics/multinode/trtllm/.
This guide demonstrates how to deploy large multimodal models that require a multi-node setup. It builds on the general multi-node deployment process described in the main multinode-examples.md guide.
Before you begin, ensure you have completed the initial environment configuration by following the Setup section in that guide.
The following sections provide specific instructions for deploying meta-llama/Llama-4-Maverick-17B-128E-Instruct, including environment variable setup and launch commands. These steps can be adapted for other large multimodal models.
Environment Variable Setup
Assuming you have already allocated your nodes via salloc, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
Disaggregated Mode
Assuming you have at least 4 4xGB200 nodes allocated (2 for prefill, 2 for decode) following the setup above, follow these steps below to launch a disaggregated deployment across 4 nodes:
[!Tip] Make sure you have a fresh environment and don’t still have the aggregated example above still deployed on the same set of nodes.
Understanding the Output
-
The
srun_disaggregated.shlaunches three srun jobs instead of two. One for frontend, one for prefill worker, and one for decode worker. -
The OpenAI frontend will listen for and dynamically discover workers as they register themselves with Dynamo’s distributed runtime:
-
The TRTLLM worker will consist of N (N=8 for TP8) MPI ranks, 1 rank on each GPU on each node, which will each output their progress while loading the model. You can see each rank’s output prefixed with the rank at the start of each log line until the model succesfully finishes loading:
-
After the model fully finishes loading on all ranks, the worker will register itself, and the OpenAI frontend will detect it, signaled by this output:
-
At this point, with the worker fully initialized and detected by the frontend, it is now ready for inference.
Example Request
To verify the deployed model is working, send a curl request:
Cleanup
To cleanup background srun processes launched by srun_aggregated.sh or
srun_disaggregated.sh, you can run:
Known Issues
- Loading
meta-llama/Llama-4-Maverick-17B-128E-Instructwith 8 nodes of H100 with TP=16 is not posssible due to Llama4 Maverick has a config"num_attention_heads": 40, trtllm engine asserts on assertself.num_heads % tp_size == 0causing the engine to crash on model loading.