For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
      • Agent Tracing
      • Agent Hints
      • Use Pi-Mono with Dynamo
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Fastokens Tokenizer
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
    • Writing Python Unified Backends
    • Writing Rust Unified Backends
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Request Schema
  • Request Flow
  • Backend Support
  • Related Request Extensions
User GuidesAgents

Agent Hints

Per-request serving hints for agentic workloads

||View as Markdown|
Edit this page
Previous

Agent Tracing

Next

Use Pi-Mono with Dynamo

Agent hints are optional per-request metadata that a harness sends under nvext.agent_hints. Dynamo parses these hints in the frontend and passes them to the router and, where supported, backend runtimes.

Use hints only for serving-relevant intent. Use nvext.agent_context for passive trace identity.

Request Schema

1{
2 "model": "my-model",
3 "messages": [
4 { "role": "user", "content": "Continue the report." }
5 ],
6 "nvext": {
7 "agent_hints": {
8 "priority": 5,
9 "osl": 1024,
10 "speculative_prefill": true
11 }
12 }
13}
HintDescription
priorityUnified request priority. Higher values move the request earlier in the router queue and are forwarded to backends that support priority scheduling or eviction.
oslExpected output sequence length in tokens. Used by the router for output block tracking and load-balancing accuracy when --router-track-output-blocks is enabled.
speculative_prefillWhen true, Dynamo can prefill the predicted next-turn prefix after the current turn completes to warm the KV cache for the next request.

Request Flow

The frontend parses nvext.agent_hints, the router uses hints for queueing and worker selection, and supported backends use forwarded hints for engine-level scheduling and cache policy.

Backend Support

Backend support is runtime-specific. For SGLang flags and behavior, see SGLang for Agentic Workloads.

FeaturevLLMSGLangTensorRT-LLM
Priority-aware routingYesYesYes
Priority-based cache evictionPlannedYesPlanned
Speculative prefillYesYesYes
Subagent KV isolation with session controlNoExperimentalNo

Related Request Extensions

agent_hints is separate from agent_context:

  • agent_context is passive identity for traces and joins.
  • agent_hints is active serving intent for routing, scheduling, and cache behavior.

Session-control metadata for SGLang subagent KV isolation lives under nvext.session_control; see NVIDIA Request Extensions.