For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
      • Frontend Guide
      • Configuration Reference
      • Tokenizer
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • HTTP & Networking
  • Router
  • AIC Prefill Load Model
  • Fault Tolerance
  • Model Discovery
  • Infrastructure
  • KServe gRPC
  • Monitoring
  • Tokenizer
  • Experimental
  • HTTP Endpoints
  • OpenAI-Compatible
  • Anthropic (Experimental)
  • Infrastructure
  • Endpoint Path Customization
  • Deprecated
  • See Also
ComponentsFrontend

Frontend Configuration Reference

Complete reference for all frontend CLI arguments, environment variables, and HTTP endpoints
||View as Markdown|
Previous

Frontend Guide

Next

Tokenizer

This page documents all configuration options for the Dynamo Frontend (python -m dynamo.frontend).

Every CLI argument has a corresponding environment variable. CLI arguments take precedence over environment variables.

HTTP & Networking

CLI ArgumentEnv VarDefaultDescription
--http-hostDYN_HTTP_HOST0.0.0.0HTTP listen address
--http-portDYN_HTTP_PORT8000HTTP listen port
--tls-cert-pathDYN_TLS_CERT_PATH—TLS certificate path (PEM). Must be paired with --tls-key-path
--tls-key-pathDYN_TLS_KEY_PATH—TLS private key path (PEM). Must be paired with --tls-cert-path

The Rust HTTP server also reads these environment variables (not exposed as CLI args):

Env VarDefaultDescription
DYN_HTTP_BODY_LIMIT_MB192Maximum request body size in MB
DYN_HTTP_GRACEFUL_SHUTDOWN_TIMEOUT_SECS5Graceful shutdown timeout in seconds

Router

CLI ArgumentEnv VarDefaultDescription
--router-modeDYN_ROUTER_MODEround-robinRouting strategy: round-robin, random, kv, direct, least-loaded, device-aware-weighted
--load-aware / --no-load-awareDYN_ROUTER_LOAD_AWAREfalsePreset for KV load-aware routing without cache-reuse signals; implies --router-mode kv
--router-kv-overlap-score-creditDYN_ROUTER_KV_OVERLAP_SCORE_CREDIT1.0Credit multiplier for device-local prefix overlap, from 0.0 to 1.0
--router-prefill-load-scaleDYN_ROUTER_PREFILL_LOAD_SCALE1.0Scale adjusted prompt-side prefill load before adding decode blocks
--router-temperatureDYN_ROUTER_TEMPERATURE0.0Softmax temperature for normalized worker sampling. 0 = deterministic
--router-kv-events / --no-router-kv-eventsDYN_ROUTER_USE_KV_EVENTStrueEnable KV cache state events from workers. Disable for prediction-based routing
--router-ttl-secsDYN_ROUTER_TTL_SECS120.0Block TTL when KV events are disabled
--router-replica-sync / --no-router-replica-syncDYN_ROUTER_REPLICA_SYNCfalseSync state across multiple router instances
--router-snapshot-thresholdDYN_ROUTER_SNAPSHOT_THRESHOLD1000000Messages before triggering a snapshot
--router-reset-states / --no-router-reset-statesDYN_ROUTER_RESET_STATESfalseReset router state on startup. Warning: affects existing replicas
--router-track-active-blocks / --no-router-track-active-blocksDYN_ROUTER_TRACK_ACTIVE_BLOCKStrueTrack blocks used by in-progress requests for load balancing
--router-assume-kv-reuse / --no-router-assume-kv-reuseDYN_ROUTER_ASSUME_KV_REUSEtrueAssume KV cache reuse when tracking active blocks
--router-track-output-blocks / --no-router-track-output-blocksDYN_ROUTER_TRACK_OUTPUT_BLOCKSfalseTrack output blocks with fractional decay during generation
--router-track-prefill-tokens / --no-router-track-prefill-tokensDYN_ROUTER_TRACK_PREFILL_TOKENStrueTrack prompt-side prefill load in worker load accounting
--router-prefill-load-modelDYN_ROUTER_PREFILL_LOAD_MODELnonePrompt-side load model: none for static load, aic for oldest-prefill decay using an AIC prediction
--router-event-threadsDYN_ROUTER_EVENT_THREADS4KV indexer worker threads. >1 enables the concurrent radix tree, including with --no-router-kv-events
--router-queue-thresholdDYN_ROUTER_QUEUE_THRESHOLD16.0Queue threshold fraction of prefill capacity. Priority hints only affect requests waiting in this queue
--router-queue-policyDYN_ROUTER_QUEUE_POLICYfcfsQueue scheduling policy: fcfs (tail TTFT), wspt (avg TTFT), or lcfs (comparison-only reverse ordering)
--decode-fallback / --no-decode-fallbackDYN_DECODE_FALLBACKfalseFall back to aggregated mode when prefill workers unavailable

AIC Prefill Load Model

These options are used only when --router-mode kv is combined with --router-prefill-load-model aic.

CLI ArgumentEnv VarDefaultDescription
--aic-backendDYN_AIC_BACKEND—Backend family to model in AIC, for example vllm or sglang
--aic-systemDYN_AIC_SYSTEM—AIC hardware/system identifier, for example h200_sxm
--aic-model-pathDYN_AIC_MODEL_PATH—Model path or model identifier used for AIC perf lookup
--aic-backend-versionDYN_AIC_BACKEND_VERSIONbackend-specificPinned AIC database version. If omitted, Dynamo uses the backend default
--aic-tp-sizeDYN_AIC_TP_SIZE1Tensor-parallel size to model in AIC
--aic-moe-tp-sizeDYN_AIC_MOE_TP_SIZE—MoE tensor-parallel size for models that require AIC MoE parallelism
--aic-moe-ep-sizeDYN_AIC_MOE_EP_SIZE—MoE expert-parallel size for models that require AIC MoE parallelism
--aic-attention-dp-sizeDYN_AIC_ATTENTION_DP_SIZE—Attention data-parallel size for models that require AIC MoE parallelism

When enabled, the frontend’s embedded KV router predicts one expected prefill duration per admitted request, using the selected worker’s overlap-derived cached prefix. The router then decays only the oldest active prefill request on each worker for prompt-side load accounting.

For MoE models, AIC requires aic_tp_size * aic_attention_dp_size == aic_moe_tp_size * aic_moe_ep_size. For Kimi-style TP-only MoE runs, set --aic-moe-tp-size to the same value as --aic-tp-size, with --aic-moe-ep-size 1 and --aic-attention-dp-size 1.

Fault Tolerance

CLI ArgumentEnv VarDefaultDescription
--migration-limitDYN_MIGRATION_LIMIT0Max request migrations per worker disconnect. 0 = disabled
--active-decode-blocks-thresholdDYN_ACTIVE_DECODE_BLOCKS_THRESHOLD1.0KV cache utilization fraction (0.0–1.0) for busy detection. Pass None to disable
--active-prefill-tokens-thresholdDYN_ACTIVE_PREFILL_TOKENS_THRESHOLD10000000Absolute token count for prefill busy detection. Pass None to disable
--active-prefill-tokens-threshold-fracDYN_ACTIVE_PREFILL_TOKENS_THRESHOLD_FRAC64.0Fraction of max_num_batched_tokens for prefill busy detection. OR logic with absolute threshold. Pass None to disable
--admission-controlDYN_ADMISSION_CONTROLnoneAdmission control mode. token-capacity applies the busy thresholds above; none clears them. Router queueing remains controlled by --router-queue-threshold

Model Discovery

CLI ArgumentEnv VarDefaultDescription
--namespaceDYN_NAMESPACE—Exact namespace for model discovery scoping
--namespace-prefixDYN_NAMESPACE_PREFIX—Namespace prefix for discovery (e.g., ns matches ns, ns-abc123). Takes precedence over --namespace
--model-nameDYN_MODEL_NAME—Override model name string
--model-pathDYN_MODEL_PATH—Path to local model directory (for private/custom models)
--kv-cache-block-sizeDYN_KV_CACHE_BLOCK_SIZE—KV cache block size override

Infrastructure

CLI ArgumentEnv VarDefaultDescription
--discovery-backendDYN_DISCOVERY_BACKENDetcdService discovery: kubernetes, etcd, file, mem
--request-planeDYN_REQUEST_PLANEtcpRequest distribution: tcp (fastest), nats
--event-planeDYN_EVENT_PLANEautoEvent publishing: nats, zmq; defaults to zmq for file/mem discovery and nats for etcd/kubernetes

KServe gRPC

CLI ArgumentEnv VarDefaultDescription
--kserve-grpc-server / --no-kserve-grpc-serverDYN_KSERVE_GRPC_SERVERfalseStart KServe gRPC v2 server
--grpc-metrics-portDYN_GRPC_METRICS_PORT8788HTTP metrics port for gRPC service

See the Frontend Guide for KServe message formats and integration details.

Monitoring

CLI ArgumentEnv VarDefaultDescription
--metrics-prefixDYN_METRICS_PREFIXdynamo_frontendPrefix for frontend Prometheus metrics
--dump-config-toDYN_DUMP_CONFIG_TO—Dump resolved config to file path

Tokenizer

CLI ArgumentEnv VarDefaultDescription
--tokenizerDYN_TOKENIZERdefaultTokenizer: default (HuggingFace) or fastokens (high-performance Rust tokenizer). See Tokenizer

Experimental

CLI ArgumentEnv VarDefaultDescription
--enable-anthropic-apiDYN_ENABLE_ANTHROPIC_APIfalseEnable /v1/messages (Anthropic Messages API)
--dyn-chat-processorDYN_CHAT_PROCESSORdynamoChat processor: dynamo (default), vllm, or sglang. See Parser Configuration for how this combines with the parser flags.
--dyn-debug-perfDYN_DEBUG_PERFfalseLog per-function timing for preprocessing (vllm processor only)
--dyn-preprocess-workersDYN_PREPROCESS_WORKERS0Worker processes for CPU-bound preprocessing. 0 = main event loop (vllm processor only)
-i / --interactiveDYN_INTERACTIVEfalseInteractive text chat mode

HTTP Endpoints

The frontend exposes the following HTTP endpoints:

OpenAI-Compatible

MethodPathDescription
POST/v1/chat/completionsChat completions (streaming and non-streaming)
POST/v1/completionsText completions
POST/v1/embeddingsText embeddings
POST/v1/responsesResponses API
POST/v1/images/generationsImage generation
POST/v1/videos/generationsVideo generation
POST/v1/videos/generations/streamVideo generation (streaming)
GET/v1/modelsList available models

Anthropic (Experimental)

MethodPathDescription
POST/v1/messagesAnthropic Messages API (requires --enable-anthropic-api)
POST/v1/messages/count_tokensToken counting for Anthropic API

Infrastructure

MethodPathDescription
GET/healthHealth check
GET/liveLiveness check
GET/metricsPrometheus metrics
GET/openapi.jsonOpenAPI specification
GET/docsSwagger UI
POST/busy_thresholdSet busy thresholds
GET/busy_thresholdGet current busy thresholds

Endpoint Path Customization

All endpoint paths can be overridden via environment variables:

Env VarDefault Path
DYN_HTTP_SVC_CHAT_PATH_ENV/v1/chat/completions
DYN_HTTP_SVC_CMP_PATH_ENV/v1/completions
DYN_HTTP_SVC_EMB_PATH_ENV/v1/embeddings
DYN_HTTP_SVC_RESPONSES_PATH_ENV/v1/responses
DYN_HTTP_SVC_MODELS_PATH_ENV/v1/models
DYN_HTTP_SVC_ANTHROPIC_PATH_ENV/v1/messages
DYN_HTTP_SVC_HEALTH_PATH_ENV/health
DYN_HTTP_SVC_LIVE_PATH_ENV/live
DYN_HTTP_SVC_METRICS_PATH_ENV/metrics

Deprecated

CLI ArgumentEnv VarDescription
--router-durable-kv-eventsDYN_ROUTER_DURABLE_KV_EVENTSUse event-plane local indexer instead

See Also

  • Frontend Overview — quick start and feature matrix
  • Frontend Guide — KServe gRPC configuration
  • NVIDIA Request Extensions (nvext) — custom request fields
  • Configuration and Tuning — detailed routing configuration
  • Metrics — available Prometheus metrics
  • Fault Tolerance — request migration and rejection