NVIDIA Request Extensions (nvext) | NVIDIA Dynamo Documentation

nvext is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. nvext fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority.

Usage

Include nvext as a top-level field alongside standard OpenAI-compatible fields:

1 {
2     "model": "my-model",
3     "messages": [{"role": "user", "content": "Hello"}],
4     "nvext": {
5         "greed_sampling": true,
6         "extra_fields": ["worker_id", "timing"],
7         "agent_hints": {
8             "osl": 1024,
9             "priority": 5,
10             "strict_priority": 1
11         }
12     }
13 }

Field Reference

Field	Type	Default	Consumed By	Description
`greed_sampling`	`bool`	`None`	Preprocessor	Forces greedy sampling regardless of other sampling parameters.
`use_raw_prompt`	`bool`	`None`	Preprocessor	Bypasses the prompt template and passes the prompt directly to the tokenizer.
`annotations`	`string[]`	`None`	Preprocessor	Triggers out-of-band information in the SSE stream via the `event:` field.
`backend_instance_id`	`u64`	`None`	Router	Routes the request to a specific backend instance.
`token_data`	`u32[]`	`None`	Preprocessor	Pre-tokenized prompt tokens. When provided with `backend_instance_id`, tokenization is skipped.
`max_thinking_tokens`	`u32`	`None`	Backend	Maximum thinking tokens allowed (passed through to backends).
`extra_fields`	`string[]`	`None`	Response builder	Fields to include in the response `nvext`. Supported: `"worker_id"`, `"timing"`, `"routed_experts"`, `"engine_data"`, `"stop_reason"`.
`prefill_worker_id`	`u64`	`None`	Router	Routes the request to a specific prefill worker (disaggregated serving).
`decode_worker_id`	`u64`	`None`	Router	Routes the request to a specific decode worker (disaggregated serving).
`agent_context`	object	`None`	Preprocessor	Passive session and trajectory identity for agent traces. See Agent Context below and Agent Tracing.
`agent_hints`	object	`None`	Router	Per-request hints for scheduling and load balancing. See Agent Hints.
`session_control`	object	`None`	Router	Session lifecycle and sticky routing for subagent KV isolation. See Session Control.

Related root-level Dynamo output option:

Field	Type	Default	Consumed By	Description
`return_tokens_as_token_ids`	`bool`	`false`	Response builder	Formats logprob token strings as `token_id:<id>` instead of decoded text.

return_tokens_as_token_ids only changes returned logprob token display. To stop on token IDs, pass integer IDs in the normal stop array, for example "stop": [576]. Strings such as "token_id:576" remain literal string stop sequences and are not parsed as token IDs.

Header Overrides

Routing fields can also be set via HTTP headers, which take priority over nvext values:

Header	Overrides
`x-worker-instance-id`	`backend_instance_id` and `decode_worker_id`
`x-prefill-instance-id`	`prefill_worker_id`

Agent Context

The agent_context sub-object carries passive session and trajectory identity for agentic requests. Dynamo uses this metadata to emit request traces when the agent trace sink is enabled. It does not change routing, scheduling, or cache behavior.

Field	Type	Required	Description
`session_type_id`	`string`	Yes	Reusable profile or agent class label.
`session_id`	`string`	Yes	Top-level agent run/session identifier.
`trajectory_id`	`string`	Yes	One schedulable reasoning/tool trajectory.
`parent_trajectory_id`	`string`	No	Parent trajectory, typically for subagents.

1 {
2     "nvext": {
3         "agent_context": {
4             "session_type_id": "deep_research",
5             "session_id": "research-run-42",
6             "trajectory_id": "research-run-42:researcher",
7             "parent_trajectory_id": "research-run-42:planner"
8         }
9     }
10 }

For identity semantics, trace sink configuration, and JSONL schema details, see Agent Tracing.

Agent Hints

The agent_hints sub-object carries per-request hints that the router uses for scheduling, load balancing, and KV cache optimization.

Field	Type	Default	Description
`priority`	`i32`	`None`	Unified soft request priority. Used for router policy scoring and backend scheduling/eviction.
`strict_priority`	`u32`	`None`	Router pending-queue tier. Higher values always precede lower values. Unset is equivalent to `0`.
`osl`	`u32`	`None`	Expected output sequence length (tokens). Used for output block tracking and resource estimation.
`speculative_prefill`	`bool`	`false`	When `true`, speculatively prefills the predicted next-turn prompt after the current turn completes to warm the KV cache.

`priority`

priority is the cross-layer scheduling hint. Higher values mean “more important” across Dynamo.

When --router-queue-threshold is set and the queue is active, higher-priority requests are shifted earlier in the router queue. Once dispatched, Dynamo forwards the same semantic priority to the backend engine for queue ordering, preemption, and KV cache eviction. Dynamo normalizes backend-specific polarity internally, including vLLM’s lower-is-higher convention.

For layer-by-layer behavior and backend requirements, see Priority Scheduling.

1 {
2     "nvext": {
3         "agent_hints": {
4             "priority": 5
5         }
6     }
7 }

`strict_priority`

strict_priority is an unsigned router-only tier for requests waiting in a router scheduler queue. The queue orders requests by (strict_priority, configured_policy_key), so FCFS, LCFS, or WSPT still orders requests within the same tier.

This field does not change backend engine priority, preempt running work, or provide ordering across router replicas. It also does not prevent an eligible new arrival from being admitted directly while other requests are parked.

1 {
2     "nvext": {
3         "agent_hints": {
4             "strict_priority": 2
5         }
6     }
7 }

`osl`

Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:

Output block tracking: When --router-track-output-blocks is enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress toward osl.
Resource estimation: Helps the router estimate total resource requirements when making routing decisions.

1 {
2     "nvext": {
3         "agent_hints": {
4             "osl": 1024
5         }
6     }
7 }

`speculative_prefill`

When set to true, the system speculatively prefills the predicted next-turn prompt after the current assistant turn completes. This is designed for multi-turn agentic workloads where the next request’s prefix is predictable.

How it works:

As the assistant response streams, the system accumulates the full response text.
Once the response finishes, a background task constructs the next-turn prompt by appending the assistant response to the conversation history (with thinking content stripped for non-last turns).
The constructed prompt is tokenized and sent as a max_tokens=1 request to warm the KV cache on a worker.
When the actual next request arrives, it benefits from the already-warm KV cache, reducing TTFT.

1 {
2     "nvext": {
3         "agent_hints": {
4             "speculative_prefill": true
5         }
6     }
7 }

Backend details:

SGLang: Requires --enable-priority-scheduling for queue ordering and --radix-eviction-policy priority for priority-based eviction.
vLLM: Requires --scheduling-policy priority.
TensorRT-LLM: Does not currently support per-request priority.

1 {
2     "nvext": {
3         "agent_hints": {
4             "priority": 5
5         }
6     }
7 }

Session Control

session_control enables sticky routing by session_id. Use action: "bind" for router-only sticky affinity without backend engine RPCs. Use action: "open" / "close" for backend streaming-session lifecycle when the engine supports it.

Field	Type	Default	Description
`session_control.session_id`	`string`	—	Unique session identifier. Present on every turn.
`session_control.action`	`string`	omitted	Optional action: `"bind"`, `"open"`, or `"close"`. Omit on intermediate turns.
`session_control.timeout`	`integer`	`300`	Inactivity timeout in seconds. Used with `action: "bind"` and `action: "open"`.

1 {
2     "nvext": {
3         "session_control": {
4             "session_id": "subagent-1",
5             "action": "open",
6             "timeout": 300
7         }
8     }
9 }

Requires --router-mode=kv on the frontend. Router-only sticky routing uses action: "bind" and does not require backend session support. Engine-backed session lifecycle requires backend support; see SGLang for Agentic Workloads for SGLang streaming-session setup details.

Response Extensions

When the client requests response metadata via extra_fields, the response includes an nvext object with the requested fields:

Field	Requested Via	Description
`worker_id`	`extra_fields: ["worker_id"]`	Prefill/decode worker IDs and data parallel ranks that processed the request.
`timing`	`extra_fields: ["timing"]`	Per-request timing information (TTFT, ITL, queue time, etc.).
`routed_experts`	`extra_fields: ["routed_experts"]`	Routed expert capture payload returned by SGLang-backed requests.
`engine_data`	`extra_fields: ["engine_data"]`	Opaque backend-provided engine metadata.
`stop_reason`	`extra_fields: ["stop_reason"]`	Backend-specific matched stop condition, returned under `nvext` because it is not part of the OpenAI completions schema. Dynamo currently serves this as a response-level field for single-choice requests; supporting `n > 1` will require an indexed per-choice shape.
`token_ids`	Automatic (GAIE Stage 1)	Tokenized prompt for reuse in Stage 2 query-only mode.

Example response `nvext`

1 {
2     "nvext": {
3         "worker_id": {
4             "prefill_worker_id": 1,
5             "prefill_dp_rank": 0,
6             "decode_worker_id": 2,
7             "decode_dp_rank": 0
8         },
9         "timing": {
10             "ttft_ms": 45.2,
11             "itl_ms": 12.1
12         }
13     }
14 }

Document	Description
Frontend Guide	KServe gRPC configuration and integration
Configuration and Tuning	Full router configuration and CLI arguments
Agent Tracing	Passive session/trajectory identity, JSONL request traces, and harness tool-event ingestion
Agent Hints	Per-request serving hints for routing, scheduling, and cache behavior
SGLang for Agentic Workloads	SGLang engine flags for priority scheduling, eviction policies, and session control