Deploying Your First Model
Deploying Your First Model
End-to-end tutorial for deploying Qwen/Qwen3-0.6B on Kubernetes using Dynamo’s recommended
DynamoGraphDeploymentRequest (DGDR) workflow — from zero to your first inference response.
This guide assumes you have already completed the platform installation and that the Dynamo operator and CRDs are running in your cluster.
What is a DynamoGraphDeploymentRequest?
A DynamoGraphDeploymentRequest (DGDR) is Dynamo’s deploy-by-intent API. You describe what
you want to run and your performance targets; Dynamo’s profiler determines the optimal
configuration automatically, then creates the live deployment for you.
For a deeper comparison, see Understanding Dynamo’s Custom Resources.
Prerequisites
Before starting, confirm:
- Platform installed:
kubectl get pods -n ${NAMESPACE}shows operator podsRunning - CRDs present:
kubectl get crd | grep dynamoshowsdynamographdeploymentrequests.nvidia.com kubectlandhelmavailable in your shell
Set these variables once — they are referenced throughout the guide:
Qwen/Qwen3-0.6B is a public model. A HuggingFace token is not strictly required to download
it, but is recommended to avoid rate limiting.
Step 1: Configure Namespace and Secrets
Verify the secret was created:
Step 2: Create the DynamoGraphDeploymentRequest
Save the following as qwen3-first-model.yaml:
Apply it (uses envsubst to substitute the RELEASE_VERSION shell variable into the YAML):
Field reference
For the full spec reference, see the DGDR API Reference and Profiler Guide.
If you are using a namespace-scoped operator with GPU discovery disabled, you must also provide explicit hardware info or the DGDR will be rejected at admission:
See the installation guide for details.
Step 3: Monitor Profiling Progress
Profiling is the automated step where Dynamo sweeps across candidate configurations (parallelism, batching, scheduling strategies) to find the one that best meets your SLA and hardware — so you don’t have to tune it manually.
Watch the DGDR status in real time:
The PHASE column progresses through:
Deployed is the success terminal state when autoApply: true (the default).
If you set autoApply: false, the phase stops at Ready — profiling is complete and the
generated DGD spec is stored in .status, but no deployment is created automatically.
To inspect and deploy it manually:
For a full status summary and events:
To follow the profiling job logs:
searchStrategy: rapid, profiling typically completes in under 15 minutes on a single GPU.Step 4: Verify the Deployment
Once the DGDR reaches Deployed, the DynamoGraphDeployment has been created automatically.
Check that everything is running:
Wait until pods are ready:
Find the frontend service name:
Step 5: Send Your First Request
Port-forward to the frontend and send an inference request:
A successful response looks like:
Your first model is now live.
Cleanup
To remove the deployment and profiling artifacts:
Deleting a DGDR does not delete the DynamoGraphDeployment it created. The DGD persists
independently so it can continue serving traffic.
Troubleshooting
DGDR stuck in Pending
Common causes: no available GPU nodes, image pull failure (check image tag; NGC credentials are
optional but may be needed if you hit rate limits pulling from public NGC), missing hardware
config for a namespace-scoped operator.
GPU node taints are a frequent cause of pods staying Pending. Many clusters (including
GKE by default and most shared/HPC environments) taint GPU nodes with
nvidia.com/gpu:NoSchedule so that only GPU-aware workloads land on them. If the profiling
job pod is stuck with a 0/N nodes are available: … node(s) had untolerated taint event,
add a toleration to your DGDR via overrides.profilingJob. The operator and profiler
automatically forward it to every candidate and deployed pod:
Profiling job fails
Pods not starting after profiling
Model not responding after port-forward
Next Steps
- Tune for production SLAs: Add
sla(TTFT, ITL) andworkload(ISL, OSL) targets to your DGDR so the profiler optimizes for your specific traffic. See the Profiler Guide for the full configuration reference and picking modes. For ready-to-use YAML — including SLA targets, private models, MoE, and overrides — see DGDR Examples. - Scale the deployment: Autoscaling guide
- SLA-aware autoscaling: Enable the Planner via
features.plannerin the DGDR — see the Planner Guide. - Inspect the generated config: Set
autoApply: falseand extract the DGD spec withkubectl get dgdr <name> -o jsonpath='{.status.profilingResults.selectedConfig}'before deploying. - Direct control: Creating Deployments — write your own
DynamoGraphDeploymentspec for full customization. - Monitor performance: Observability
- Try specific backends: vLLM, SGLang, TensorRT-LLM