Agent Context and Tracing | NVIDIA Dynamo Documentation

Agent workloads are easier to debug when model calls and tool calls share a common workflow identity. Dynamo agent tracing provides that view without asking the harness to measure serving internals itself.

The harness adds lightweight workflow metadata to each LLM request and can publish tool lifecycle events over a local ZMQ socket. Dynamo then writes a single trace stream that combines harness-provided structure with Dynamo-owned request metrics such as token counts, timing, cache hit rate, queue depth, and worker placement.

This is passive observability. Agent context does not change routing, scheduling, or cache behavior.

Step 1: Enable Dynamo Trace Output

For most local profiling runs, use rotating compressed JSONL:

$ export DYN_AGENT_TRACE_SINKS=jsonl_gz
$ export DYN_AGENT_TRACE_OUTPUT_PATH=/tmp/dynamo-agent-trace

This writes files like:

/tmp/dynamo-agent-trace.000000.jsonl.gz
/tmp/dynamo-agent-trace.000001.jsonl.gz

To ingest harness tool events, also configure the local ZMQ endpoint that the harness will publish on:

$ export DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_ENDPOINT=tcp://127.0.0.1:20390

Then start any Dynamo OpenAI-compatible backend.

Environment variable reference

Environment Variable	Required	Default	Description
`DYN_AGENT_TRACE_SINKS`	Yes	unset	Enables local trace sinks. Supported values: `jsonl`, `jsonl_gz`, `stderr`, or a comma-separated list such as `jsonl_gz,stderr`.
`DYN_AGENT_TRACE_OUTPUT_PATH`	If `jsonl` or `jsonl_gz` is selected	unset	Local trace output path. For `jsonl`, this is the literal file path. For `jsonl_gz`, this is the segment prefix used to derive `.jsonl.gz` files.
`DYN_AGENT_TRACE_CAPACITY`	No	`1024`	In-process trace bus capacity.
`DYN_AGENT_TRACE_JSONL_BUFFER_BYTES`	No	`1048576`	JSONL writer buffer size. For `jsonl_gz`, this is the max uncompressed batch size before appending a complete gzip member.
`DYN_AGENT_TRACE_JSONL_FLUSH_INTERVAL_MS`	No	`1000`	JSONL periodic flush interval. For `jsonl_gz`, each flush appends a complete gzip member.
`DYN_AGENT_TRACE_JSONL_GZ_ROLL_BYTES`	No	`268435456`	`jsonl_gz` segment roll threshold in uncompressed bytes.
`DYN_AGENT_TRACE_JSONL_GZ_ROLL_LINES`	No	unset	Optional `jsonl_gz` segment roll threshold in records.
`DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_ENDPOINT`	No	unset	Local ZMQ endpoint for harness tool events. Setting this enables tool event ingestion.
`DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_TOPIC`	No	unset	Optional ZMQ topic filter for harness tool events.

DYN_AGENT_TRACE_SINKS is the local output enable switch. Setting DYN_AGENT_TRACE_OUTPUT_PATH alone does not enable tracing. Setting only the ZMQ endpoint enables tool ingestion but does not create local files unless a sink is also configured.

Step 2: Add Context to LLM Calls

Each harness LLM call should include nvext.agent_context:

1 {
2     "model": "my-model",
3     "messages": [
4         { "role": "user", "content": "Research Dynamo agent tracing." }
5     ],
6     "nvext": {
7         "agent_context": {
8             "workflow_type_id": "deep_research",
9             "workflow_id": "research-run-42",
10             "program_id": "research-run-42:researcher",
11             "parent_program_id": "research-run-42:planner"
12         }
13     }
14 }

When using the OpenAI Python client, pass Dynamo’s extension fields through extra_body and set x-request-id through extra_headers:

1 import uuid
2 
3 
4 def instrument_llm_request(kwargs, agent_context):
5     body = dict(kwargs.get("extra_body") or {})
6     nvext = dict(body.get("nvext") or {})
7     nvext["agent_context"] = dict(agent_context)
8     body["nvext"] = nvext
9 
10     headers = dict(kwargs.get("extra_headers") or {})
11     headers.setdefault("x-request-id", str(uuid.uuid4()))
12 
13     out = dict(kwargs)
14     out["extra_body"] = body
15     out["extra_headers"] = headers
16     return out

x-request-id is the harness’s logical LLM-call ID. Dynamo copies it into request.x_request_id; it is separate from Dynamo’s internal request ID.

Field	Required	Meaning
`workflow_type_id`	Yes	Reusable workload/profile class, such as `deep_research` or `coding_agent`.
`workflow_id`	Yes	Top-level run identifier.
`program_id`	Yes	One schedulable reasoning/tool trajectory.
`parent_program_id`	No	Parent program for subagents.

Step 3: Send Tool Events to Dynamo

Harnesses bind a long-lived local ZMQ PUB socket and publish tool lifecycle records on the configured endpoint. Dynamo accepts tool_start, tool_end, and tool_error records from the harness and writes them to the same trace stream as LLM request records.

The ZMQ wire format is:

[topic, seq_be_u64, msgpack(AgentTraceRecord)]

Use the same producer pattern as our KV event publisher pattern in vllm and SGLang: a bounded queue, a background publisher thread, monotonically increasing sequence numbers, and a PUB socket with a high-water mark. Plain ZMQ PUB/SUB is best-effort for early frames, so a terminal tool record should be self-contained with started_at_unix_ms, ended_at_unix_ms, and duration_ms. Keep tool_start for live/in-flight status, but do not require it to reconstruct completed spans.

Publisher Ownership

Most framework integrations should create one exporter per harness or runtime instance. In-process systems, such as callback or middleware integrations, can emit records directly into the root queued publisher.

If a harness runs tools or subagents in child processes, do not let each child bind the same ZMQ endpoint. Keep the root process as the only network publisher and forward child records to it over the framework event bus, a multiprocessing queue, or a local collector. The child should forward the same normalized AgentTraceRecord; the parent handles ZMQ framing and sequence numbers.

in-process callbacks / tool wrappers
  -> root queued publisher -> ZMQ PUB -> Dynamo relay
child process tools / subagents
  -> process queue or event bus -> root queued publisher -> ZMQ PUB -> Dynamo relay

A compact publisher implementation is included below for harness authors that need a reference.

Compact Python publisher

1 import atexit
2 import msgpack
3 import queue
4 import struct
5 import threading
6 import time
7 import zmq
8 
9 
10 class ZmqToolEventPublisher:
11     def __init__(self, endpoint: str, topic: str = ""):
12         self.topic = topic.encode("utf-8")
13         self.seq = 0
14         self.queue = queue.Queue(maxsize=100_000)
15         self.socket = zmq.Context.instance().socket(zmq.PUB)
16         self.socket.set_hwm(100_000)
17         self.socket.bind(endpoint)
18         self.thread = threading.Thread(target=self._run, daemon=True)
19         self.thread.start()
20         atexit.register(self.shutdown)
21 
22     def publish(self, record: dict):
23         payload = msgpack.packb(record, use_bin_type=True)
24         self.queue.put_nowait(payload)
25 
26     def _run(self):
27         while True:
28             payload = self.queue.get()
29             if payload is None:
30                 break
31             seq = self.seq
32             self.seq += 1
33             self.socket.send_multipart([self.topic, struct.pack(">Q", seq), payload])
34             self.queue.task_done()
35 
36     def shutdown(self):
37         self.queue.put_nowait(None)
38         self.thread.join(timeout=1.0)
39         self.socket.close(linger=0)

The record must include agent_context. Tool events should use the same workflow_type_id, workflow_id, and program_id as the surrounding LLM calls; include parent_program_id for subagent tools when it is available. Dynamo uses these fields to group request and tool records into the same workflow/program lanes.

1 {
2     "schema": "dynamo.agent.trace.v1",
3     "event_type": "tool_end",
4     "event_time_unix_ms": 1777312801500,
5     "event_source": "harness",
6     "agent_context": {
7         "workflow_type_id": "deep_research",
8         "workflow_id": "research-run-42",
9         "program_id": "research-run-42:researcher"
10     },
11     "tool": {
12         "tool_call_id": "call-abc",
13         "tool_class": "web_search",
14         "status": "succeeded",
15         "started_at_unix_ms": 1777312801080,
16         "ended_at_unix_ms": 1777312801500,
17         "duration_ms": 420.5
18     }
19 }

The runtime event-plane hop is internal to Dynamo. Harnesses should publish to the ZMQ endpoint, not directly to Dynamo’s event plane.

Step 4: Inspect the Trace

Read compressed trace records directly:

$ gzip -cd "${DYN_AGENT_TRACE_OUTPUT_PATH}".*.jsonl.gz | jq .

Each line is a recorder envelope:

1 { "timestamp": 1234, "event": { "schema": "dynamo.agent.trace.v1" } }

Convert traces to Chrome Trace JSON for Perfetto UI:

$ uv run --no-project python benchmarks/agent_trace/convert_to_perfetto.py \
>   "${DYN_AGENT_TRACE_OUTPUT_PATH}".*.jsonl.gz \
>   --output "${DYN_AGENT_TRACE_OUTPUT_PATH}.perfetto.json"

Open ${DYN_AGENT_TRACE_OUTPUT_PATH}.perfetto.json in Perfetto UI. Each LLM request becomes a timeline slice grouped by workflow and program lane. Tool terminal records become tool slices on adjacent tool tracks. The converter prefers explicit started_at_unix_ms/ended_at_unix_ms, falls back to duration_ms, then pairs with the matching tool_start record when present.

Useful converter flags:

Flag	Meaning
`--include-markers`	Emit first-token instant markers.
`--no-stages`	Show request slices without prefill/decode stage slices.
`--separate-stage-tracks`	Place prefill/decode stages on adjacent tracks for debugging timeline nesting.

Harness Integration Patterns

An existing harness does not need to import Dynamo packages or link against Dynamo runtime APIs. Framework integrations should use this shape:

Add a small helper module that stores the current agent_context in a context variable.
Wrap each agent run with that context so LLM calls and tool records share the same workflow_id and program_id.
Call one helper before each OpenAI-compatible LLM request to merge extra_body.nvext.agent_context and set x-request-id.
For LangGraph/LangChain-style in-process runtimes, implement callbacks or middleware that emit directly to the root publisher.
Emit tool_start and a terminal tool_end or tool_error wherever the harness executes model-requested tools. Include started_at_unix_ms, ended_at_unix_ms, and duration_ms on terminal records so completed spans survive best-effort PUB/SUB startup loss.
Propagate context through thread pools, subprocesses, and subagent launches when those paths can make LLM calls or emit tool records.
Register a queued ZMQ publisher at process startup when tool tracing is enabled.
If tools or subagents run in subprocesses, forward normalized tool records back to the root publisher instead of binding another ZMQ endpoint.

You do not need custom code in every tool implementation when existing tool calls already pass through shared harness code. Add explicit hooks only for paths that bypass that flow, such as direct OpenAI calls inside a tool, background executor work that loses context variables, or subagent launches that need parent_program_id.

That keeps the harness dependency boundary simple:

harness code knows:
  - workflow_id / program_id
  - x-request-id
  - tool start/end/error
  - local ZMQ endpoint
Dynamo code knows:
  - request timing
  - token counts
  - cache metrics
  - worker placement
  - trace sinks

End-to-End Example with ms-agent

The ms-agent integration currently lives on Ishan’s fork:

Fork: github.com/ishandhanani/ms-agent
Branch: idhanani/dynamo-agent-trace

Install the fork in editable mode:

$ git clone https://github.com/ishandhanani/ms-agent.git
$ cd ms-agent
$ git checkout idhanani/dynamo-agent-trace
$ 
$ uv venv .venv
$ source .venv/bin/activate
$ uv pip install -r requirements/research.txt
$ uv pip install -e .

Start Dynamo with trace sinks and the tool-event relay enabled:

$ export DYN_AGENT_TRACE_SINKS=jsonl_gz
$ export DYN_AGENT_TRACE_OUTPUT_PATH=/tmp/dynamo-ms-agent-trace
$ export DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_ENDPOINT=tcp://127.0.0.1:20390
$ 
$ # Start a Dynamo OpenAI-compatible backend in this environment.

Point ms-agent at the Dynamo frontend from a second shell:

$ cd ms-agent
$ source .venv/bin/activate
$ 
$ export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
$ export OPENAI_API_KEY=unused
$ export DYN_AGENT_WORKFLOW_TYPE_ID=ms_agent
$ export DYN_AGENT_WORKFLOW_ID=ms-agent-$(date +%s)
$ export DYN_AGENT_TOOL_EVENTS_ZMQ_ENDPOINT=tcp://127.0.0.1:20390

Use DYN_AGENT_TRACE_* variables for the Dynamo runtime and DYN_AGENT_* variables for the ms-agent harness process.

The fork automatically attaches nvext.agent_context and x-request-id to ms-agent OpenAI-compatible LLM calls while an agent context is active. When DYN_AGENT_TOOL_EVENTS_ZMQ_ENDPOINT is set, the ms-agent CLI also binds a ZMQ PUB socket and publishes tool lifecycle records to Dynamo’s tool-event relay. Shared tool execution paths publish directly to that root publisher; agent_tools subprocesses forward normalized tool records back to the root process, so subprocess isolation remains enabled without each child binding the endpoint. Python entrypoints that do not use the CLI lazily initialize the same publisher on the first tool event.

For DeepResearch v2, keep the normal ms-agent setup: configure OPENAI_BASE_URL, OPENAI_API_KEY, search keys such as EXA_API_KEY, and the model names in projects/deep_research/v2/*.yaml. Then run the workflow from the fork root:

--trust_remote_code true is security-sensitive. Use it only with trusted repositories and configs.

$ PYTHONPATH=. uv run --active --no-sync python ms_agent/cli/cli.py run \
>   --config projects/deep_research/v2/researcher.yaml \
>   --query "Write your research question here" \
>   --trust_remote_code true \
>   --output_dir output/deep_research/runs

The CLI path captures Dynamo LLM request records through the forked ms-agent OpenAI wrappers and publishes tool events from shared ms-agent tool execution paths.

Record Semantics

Dynamo emits request_end after the response stream completes or is dropped. Nullable fields are omitted when the serving path did not record them.

1 {
2     "schema": "dynamo.agent.trace.v1",
3     "event_type": "request_end",
4     "event_time_unix_ms": 1777312801000,
5     "event_source": "dynamo",
6     "agent_context": {
7         "workflow_type_id": "deep_research",
8         "workflow_id": "research-run-42",
9         "program_id": "research-run-42:researcher",
10         "parent_program_id": "research-run-42:planner"
11     },
12     "request": {
13         "request_id": "dynamo-request-id",
14         "x_request_id": "llm-call-42",
15         "model": "my-model",
16         "input_tokens": 4096,
17         "output_tokens": 512,
18         "cached_tokens": 3584,
19         "request_received_ms": 1777312800000,
20         "prefill_wait_time_ms": 12.1,
21         "prefill_time_ms": 70.3,
22         "ttft_ms": 82.4,
23         "total_time_ms": 1000.1,
24         "avg_itl_ms": 1.8,
25         "kv_hit_rate": 0.875,
26         "kv_transfer_estimated_latency_ms": 4.2,
27         "queue_depth": 3,
28         "worker": {
29             "prefill_worker_id": 0,
30             "prefill_dp_rank": 0,
31             "decode_worker_id": 1,
32             "decode_dp_rank": 0
33         }
34     }
35 }

Request records capture Dynamo-owned serving metrics:

Field	Meaning
`request_id`	Dynamo request ID for the LLM call.
`x_request_id`	Caller-provided logical request ID when present.
`model`	Requested model name.
`input_tokens`	Prompt/input token count when known.
`output_tokens`	Final output token count when known.
`cached_tokens`	Prompt tokens served from prefix/KV cache when known.
`request_received_ms`	Request receive time in Unix epoch milliseconds.
`prefill_wait_time_ms`	Time from request receipt to prefill start.
`prefill_time_ms`	Time from prefill start to first token.
`ttft_ms`	Time from request receipt to first token.
`total_time_ms`	Time from request receipt to request completion.
`avg_itl_ms`	Average inter-token latency after first token.
`kv_hit_rate`	Effective KV-cache hit rate observed by the router.
`kv_transfer_estimated_latency_ms`	Upper-bound estimated disaggregated KV transfer latency.
`queue_depth`	Router queue depth observed when routing the request.
`worker`	Prefill/decode worker IDs and DP ranks when recorded.

Trace records do not include prompt/response content, sampling parameters, finish reason, or error status. Use the audit sink for request/response payload capture and OpenTelemetry export for span-based observability.

Consistency Model

Trace output is best-effort profiling data, not durable audit data. Dynamo writes LLM request records and harness tool records into the same trace stream, but it does not commit them transactionally.

Delayed tool records are expected. Each normalized record carries event_time_unix_ms, and offline tools should order records by event time rather than by JSONL line order. The Perfetto converter does this before rendering request and tool slices.

The trace file does not prove completeness. Records can be absent if Dynamo exits before sink workers drain, if the trace bus or sink lags and drops records, or if the ZMQ/event-plane path drops a harness event.

Current Scope

Agent context is passive metadata.
Agent request trace emission is currently wired for /v1/chat/completions.
Supported sinks are jsonl, jsonl_gz, and stderr.
Tool events enter through the Dynamo-owned ZMQ relay.
Dynamo does not expose a separate direct event-plane ingress path for harness tool events.
Future scheduler/profiler consumers should read the normalized trace bus.