DynoSim Sweeps
A DynoSim sweep runs many simulated trials across candidate topologies, router settings, and timing-model inputs, then ranks the results against SLA constraints and GPU budget. Use sweeps when a single DynoSim run is not enough and you want to search the design space before validating on real GPUs.
The current Python API is dynamo.profiler.utils.replay_optimize. The docs use “DynoSim sweep” as the product term while keeping the existing implementation name for now.
What It Answers
A sweep answers a concrete deployment question:
- given a fixed GPU budget
- for a workload with prefix overlap
- and latency SLAs that still permit meaningful throughput
which topology, worker split, and router settings produce the best simulated result?
For disaggregated deployments, the search can cover:
- tensor-parallel shape for prefill and decode workers
- prefill and decode worker counts
- KV-router overlap credit
- prompt-load scaling
- throughput, TTFT, ITL, or end-to-end latency objectives
This is a heuristic search over simulated states, not an exact optimizer over every feasible configuration.
How It Works
Each candidate state is evaluated by the DynoSim run harness. The optimizer records the metrics from each run, filters candidates that violate SLA or GPU-budget constraints, and returns the best feasible state plus the full evaluated table for analysis.
The descent is budget-focused: each step prunes to near-budget-edge states so the sweep ends up at a TP/worker shape that actually consumes the available GPU budget, rather than at a pure throughput-per-GPU point. Aggregated sweeps collapse the TP and worker dimensions into (tp, workers) but otherwise follow the same idea.
Spec Shape
The public API takes a single ReplayOptimizeSpec composed of:
Field names use lowerCamelCase to align with DynamoGraphDeploymentRequest concepts. Method names stay snake_case to match Pydantic convention.
Prerequisites
Run from the repository root.
Use the project virtual environment:
If the runtime _core bindings are not importable yet, build them first:
The example uses AIC-backed timing by default:
- AIC enumerates dense TP candidates
- AIC-backed engine timing is used for candidate configs
Install aiconfigurator into the project environment:
If a regular install fails to load usable perf data, reinstall from a source checkout that has real systems data materialized:
If DynoSim sweep setup fails with AIC errors about missing perf databases or parse failures such as KeyError: 'gemm_dtype', inspect the installed files under:
If those files begin with version https://git-lfs.github.com/spec/v1, you have Git LFS pointer stubs instead of real perf tables. Install aiconfigurator from a checkout or wheel that includes the real LFS materialized payloads in systems/.
When running directly from a source checkout, expose the in-repo Python components and runtime bindings:
If the sweep uses multiple worker processes, prefer a real script file over a heredoc. On macOS, ProcessPoolExecutor child workers need a stable module path, and the driver module must guard its entry behind if __name__ == "__main__":.
For KV-router logs, this filter keeps the run readable without hiding useful info output:
Run The Example
The canonical starting point is the checked-in driver script:
The default example searches a synthetic disaggregated KV-router workload using AIC-backed candidate timing. It prints the best feasible state and a table of top feasible configurations.
The example uses:
- model:
Qwen/Qwen3-32B - backend:
vllm - GPU SKU:
h200_sxm - total simulated GPUs:
16 - router mode:
kv_router - synthetic workload:
isl=32768osl=256requestCount=5000concurrency=200sharedPrefixRatio=0.5numPrefixGroups=50
The GPU budget is a simulated search constraint. You do not need 16 real GPUs locally to run the search.
The base engine args stay conservative:
block_size=512enable_prefix_caching=True- explicit
worker_typefor prefill versus decode
The example intentionally omits num_gpu_blocks; AIC-backed DynoSim estimates capacity for each candidate TP shape unless a base input explicitly pins it.
This setup does not force scheduler-specific bottlenecks such as:
enable_chunked_prefill- a small
max_num_seqs - a pinned
max_num_batched_tokens
Only add those when the experiment is specifically about scheduler limits.
Run Against A Trace
To run against a Mooncake-style trace instead of the synthetic workload:
For a public starting point, download the FAST’25 toolagent trace:
Then run:
In trace mode:
traceFilepoints at the Mooncake-style JSONL inputarrivalSpeedupRatiocompresses or stretches the trace arrival process- synthetic-only knobs such as
isl,osl,requestCount,concurrency,sharedPrefixRatio, andnumPrefixGroupsare ignored
Important notes for the public toolagent trace:
- the dataset uses Mooncake-style
hash_idswith512tokens per block - the underlying
run_trace_replay(...)API defaultstrace_block_sizeto512 WorkloadSpecdoes not yet expose a separatetraceBlockSizefield
Customize A Sweep
Treat the example driver as a starting point, not a frozen harness. Modify it as needed for your search:
- change the
WorkloadSpecshape or switch to a trace source withtraceFile - add SLA bounds on
SLASpec, such asttft,itl,e2eLatency, or their p95 variants - change
RouterSpec.overlapCreditswithin the valid0.0to1.0range - change
RouterSpec.prefillLoadScaleswhen you want to weigh TTFT/prompt-side load more or less heavily - print different columns from
result.evaluated_dforresult.feasible_df - persist the tables to CSV or Parquet for downstream analysis
Useful axes to vary:
HardwareSpec.totalGpusRouterSpec.overlapCreditsRouterSpec.prefillLoadScalesWorkloadSpec.sharedPrefixRatioWorkloadSpec.numPrefixGroups- base prefill/decode engine args
If you want to compare routing strategies directly, use RouterSpec(mode="both") instead of the default KV-router-only search.
Outputs
The optimizer returns a DenseReplayOptimizationResult with:
best_feasible: best visited state that satisfies all configured SLA and GPU-budget constraintsbest_infeasible: best visited state that misses at least one SLA bound or the budgetevaluated_df: all visited statesfeasible_df: only feasible states
Common columns to inspect:
- topology:
prefill_tp,decode_tp,prefill_workers,decode_workers - routing:
router_mode,overlap_score_credit,prefill_load_scale - budget:
total_gpus_usedThis is the simulated GPU footprint of the candidate state, not a count of GPUs actually allocated on the machine running the search. - throughput:
output_throughput_tok_s - cache behavior:
prefix_cache_reused_ratio - latency:
mean_ttft_ms,mean_tpot_ms,mean_e2e_latency_ms
The report DataFrame still uses the Rust DynoSim runner’s metric keys (mean_ttft_ms, mean_tpot_ms, mean_e2e_latency_ms) even though the input SLASpec uses DGDR-style camelCase names (ttft, itl, e2eLatency). SLASpec carries an internal translation map.
In local testing, the default synthetic setup produced a non-trivial mean-E2E winner around:
prefill_tp=4,decode_tp=1,prefill_workers=3,decode_workers=4,overlap_score_credit=0.5,prefill_load_scale=1.0output_throughput_tok_s ~= 970,prefix_cache_reused_ratio ~= 0.5,mean_ttft_ms ~= 42800,mean_tpot_ms ~= 35,mean_e2e_latency_ms ~= 51900
Treat those as sanity-check ranges, not fixed assertions.
Relationship To DynoSim Runs
A DynoSim run answers “how does this one configuration perform?” A DynoSim sweep answers “which configuration should I try next?”
For final validation, take feasible candidates into a live Mocker deployment or a real-GPU AIPerf benchmark. DynoSim is designed to narrow the search space before cluster validation, not to replace it.