FastVideo
This guide covers deploying FastVideo text-to-video generation on Dynamo using a custom worker (worker.py) exposed through the /v1/videos endpoint.
Dynamo also supports diffusion through built-in backends: SGLang Diffusion (LLM diffusion, image, video), vLLM-Omni (text-to-image, text-to-video), and TRT-LLM Video Diffusion. See the Diffusion Overview for the full support matrix.
Overview
- Default model:
FastVideo/LTX2-Distilled-Diffusers— a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5. - Two-stage pipeline: Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
- Optimized inference: FP4 quantization and
torch.compileare enabled by default for maximum throughput. - Response format: Returns one complete MP4 payload per request as
data[0].b64_json(non-streaming). - Concurrency: One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
This example is optimized for NVIDIA B200/B300 GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing --disable-optimizations to worker.py, which disables FP4 quantization, torch.compile, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
Docker Image Build
The local Docker workflow builds a runtime image from the Dockerfile:
- Base image:
nvidia/cuda:13.1.1-devel-ubuntu24.04 - Installs FastVideo from GitHub
- Installs Dynamo from the
release/1.0.0branch (for/v1/videossupport) - Compiles a flash-attention fork from source
The first Docker image build can take 20–40+ minutes because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling flash-attention can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower MAX_JOBS in the Dockerfile to reduce parallel compile memory usage. The flash-attn install notes specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
Warmup Time
On first start, workers download model weights and run compile/warmup steps. Expect roughly 10–20 minutes before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around 35 seconds while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
When using Kubernetes, mount a shared Hugging Face cache PVC (see Kubernetes Deployment) so model weights are downloaded once and reused across pod restarts.
Local Deployment
Prerequisites
For Docker Compose:
- Docker Engine 26.0+
- Docker Compose v2
- NVIDIA Container Toolkit
For host-local script:
- Python environment with Dynamo + FastVideo dependencies installed
- CUDA-compatible GPU runtime available on host
Option 1: Docker Compose
The Compose file builds from the Dockerfile and exposes the API on http://localhost:8000. See the Docker Image Build section for build time expectations.
Option 2: Host-Local Script
Environment variables:
Example:
--disable-optimizations is a worker.py flag (not a dynamo.frontend flag), so pass it through WORKER_EXTRA_ARGS.
The script writes logs to:
.runtime/logs/worker.log.runtime/logs/frontend.log
Kubernetes Deployment
Files
Prerequisites
- Dynamo Kubernetes Platform installed
- GPU-enabled Kubernetes cluster
- FastVideo runtime image pushed to your registry
- Optional HF token secret (for gated models)
Create a Hugging Face token secret if needed:
Deploy
For clusters with tainted user-workload nodes and private registry pulls:
- Set your pull secret name and image in
agg_user_workload.yaml. - Apply:
Update Image Quickly
Verify and Access
Test Request
If this is the first request after startup, expect it to take longer while warmup completes. See Warmup Time for details.
Send a request and decode the response:
Worker Configuration Reference
CLI Flags
Request Parameters (nvext)
Environment Variables
Troubleshooting
Source Code
The example source lives at examples/diffusers/ in the Dynamo repository.
See Also
- vLLM-Omni Text-to-Video — vLLM-Omni video generation via
/v1/videos - vLLM-Omni Text-to-Image — vLLM-Omni image generation
- SGLang Video Generation — SGLang video generation worker
- SGLang Image Diffusion — SGLang image diffusion worker
- TRT-LLM Video Diffusion — TensorRT-LLM video diffusion quick start
- Diffusion Overview — Full backend support matrix