Example: Multi-node TRTLLM Workers with Dynamo on Slurm
Note: The scripts referenced in this example (such as
srun_aggregated.shandsrun_disaggregated.sh) can be found inexamples/basics/multinode/trtllm/.
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via mpirun or srun. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate two examples launching multinode workers
on a slurm cluster with srun:
- Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16 worker across 4 GB200 nodes
- Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode worker (4 nodes) across a total of 8 GB200 nodes.
NOTE: Some of the scripts used in this example like start_frontend_services.sh and
start_trtllm_worker.sh should be translatable to other environments like Kubernetes, or
using mpirun directly, with relative ease.
Setup
For simplicity of the example, we will make some assumptions about your slurm cluster:
-
First, we assume you have access to a slurm cluster with multiple GPU nodes available. For functional testing, most setups should be fine. For performance testing, you should aim to allocate groups of nodes that are performantly inter-connected, such as those in an NVL72 setup.
-
Second, we assume this slurm cluster has the Pyxis SPANK plugin setup. In particular, the
srun_aggregated.shscript in this example will usesrunarguments like--container-image,--container-mounts, and--container-envthat are added tosrunby Pyxis. If your cluster supports similar container based plugins, you may be able to modify the script to use that instead. -
Third, we assume you have already built a recent Dynamo+TRTLLM container image as described here. This is the image that can be set to the
IMAGEenvironment variable in later steps. -
Fourth, we assume you pre-allocate a group of nodes using
salloc. We will allocate 8 nodes below as a reference command to have enough capacity to run both examples. If you plan to only run the aggregated example, you will only need 4 nodes. If you customize the configurations to require a different number of nodes, you can adjust the number of allocated nodes accordingly. Pre-allocating nodes is technically not a requirement, but it makes iterations of testing/experimenting easier.Make sure to set your
PARTITIONandACCOUNTaccording to your slurm cluster setup: -
Lastly, we will assume you are inside an interactive shell on one of your allocated nodes, which may be the default behavior after executing the
salloccommand above depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
Environment Variable Setup
This example aims to automate as much of the environment setup as possible, but all slurm clusters and environments are different, and you may need to dive into the scripts to make modifications based on your specific environment.
Assuming you have already allocated your nodes via salloc, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
Aggregated WideEP
Assuming you have at least 4 nodes allocated following the setup steps above, follow these steps below to launch an aggregated deployment across 4 nodes:
Disaggregated WideEP
Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode) following the setup above, follow these steps below to launch a disaggregated deployment across 8 nodes:
[!Tip] Make sure you have a fresh environment and don’t still have the aggregated example above still deployed on the same set of nodes.
[!Tip] To launch multiple replicas of the configured prefill/decode workers, you can set NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
Understanding the Output
- The
srun_aggregated.shlaunches twosrunjobs. The first launches etcd, NATS, and the OpenAI frontend on the head node only called “node1” in the example output below. The second launches a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node using 4 GPUs each. - The OpenAI frontend will listen for and dynamically discover workers as
they register themselves with Dynamo’s distributed runtime:
- The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
GPU on each node, which will each output their progress while loading the model.
You can see each rank’s output prefixed with the rank at the start of each log line
until the model succesfully finishes loading:
- After the model fully finishes loading on all ranks, the worker will register itself,
and the OpenAI frontend will detect it, signaled by this output:
- At this point, with the worker fully initialized and detected by the frontend, it is now ready for inference.
- For
srun_disaggregated.sh, it follows a very similar flow, but instead launches three srun jobs instead of two. One for frontend, one for prefill worker, and one for decode worker.
Example Request
To verify the deployed model is working, send a curl request:
Cleanup
To cleanup background srun processes launched by srun_aggregated.sh or
srun_disaggregated.sh, you can run:
Known Issues
- This example has only been tested on a 4xGB200 node setup with 16 GPUs using FP4 weights. In theory, the example should work on alternative setups such as H100 nodes with FP8 weights, but this hasn’t been tested yet.
- WideEP configs in this directory are still being tested. A WideEP specific example with documentation will be added once ready.
- There are known issues where WideEP workers may not cleanly shut down:
- This may lead to leftover shared memory files in
/dev/shm/moe_*. For now, you must manually clean these up before deploying again on the same set of nodes. - Similarly, there may be GPU memory left in-use after killing the
srunjobs. After cleaning up any leftover shared memory files as described above, the GPU memory may slowly come back. You can runwatch nvidia-smito check on this behavior. If you don’t free the GPU memory before the next deployment, you may get a CUDA OOM error while loading the model. - There is mention of this issue in the relevant TRT-LLM blog here.
- This may lead to leftover shared memory files in