NVIDIA Request Extensions (nvext)
NVIDIA Request Extensions (nvext)
nvext is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. nvext fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority.
Usage
Include nvext as a top-level field alongside standard OpenAI-compatible fields:
Field Reference
Related root-level Dynamo output option:
return_tokens_as_token_ids only changes returned logprob token display. To stop on
token IDs, pass integer IDs in the normal stop array, for example
"stop": [576]. Strings such as "token_id:576" remain literal string stop
sequences and are not parsed as token IDs.
Header overrides
Routing fields can also be set via HTTP headers, which take priority over nvext values:
The unprefixed forms (x-worker-instance-id, x-prefill-instance-id, x-dp-rank,
x-data-parallel-rank, and x-prefill-dp-rank) are compatibility aliases planned for future
deprecation. Use the x-dynamo-* headers for new integrations.
Session identity is header-only. Use the coding-agent headers or Dynamo
session headers described in Session IDs;
nvext does not accept session identity fields.
When session affinity is enabled with --router-session-affinity-ttl-secs, the
router uses X-Dynamo-Session-ID for immutable endpoint- and phase-scoped affinity.
On etcd and shared FileStore, replicas coordinate through a distributed claim while
the request hot path uses a process-local cache. Existing local or shared bindings
override routing headers; the headers above are proposals only when no binding exists.
Memory and Kubernetes discovery do not provide cross-process affinity. See
Configuration and Tuning for
claim lifetime, cache TTL, terminal close, and failure behavior.
For trace sink configuration and JSONL schema details, see Agent Tracing.
Agent Hints
The agent_hints sub-object carries per-request hints that the router uses for scheduling, load balancing, and KV cache optimization.
priority
priority is the cross-layer scheduling hint. Higher values mean “more
important” across Dynamo.
When --router-queue-threshold is set and the queue is active, higher-priority requests are shifted earlier in the router queue. Once dispatched, Dynamo forwards the same semantic priority to the backend engine for queue ordering, preemption, and KV cache eviction. Dynamo normalizes backend-specific polarity internally, including vLLM’s lower-is-higher convention.
For layer-by-layer behavior and backend requirements, see Priority Scheduling.
strict_priority
strict_priority is an unsigned router-only tier for requests waiting in a
router scheduler queue. The queue orders requests by
(strict_priority, configured_policy_key), so FCFS, LCFS, or WSPT still orders
requests within the same tier.
This field does not change backend engine priority, preempt running work, or provide ordering across router replicas. It also does not prevent an eligible new arrival from being admitted directly while other requests are parked.
osl
Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:
- Output block tracking: When
--router-track-output-blocksis enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress towardosl. - Resource estimation: Helps the router estimate total resource requirements when making routing decisions.
speculative_prefill
When set to true, the system speculatively prefills the predicted next-turn prompt after the current assistant turn completes. This is designed for multi-turn agentic workloads where the next request’s prefix is predictable.
How it works:
- As the assistant response streams, the system accumulates the full response text.
- Once the response finishes, a background task constructs the next-turn prompt by appending the assistant response to the conversation history (with thinking content stripped for non-last turns).
- The constructed prompt is tokenized and sent as a
max_tokens=1request to warm the KV cache on a worker. - When the actual next request arrives, it benefits from the already-warm KV cache, reducing TTFT.
Backend details:
- SGLang: Requires
--enable-priority-schedulingfor queue ordering and--radix-eviction-policy priorityfor priority-based eviction. - vLLM: Requires
--scheduling-policy priority. - TensorRT-LLM: Does not currently support per-request priority.
Response Extensions
When the client requests response metadata via extra_fields, the response includes an nvext object with the requested fields: