---
title: Deploying Your First Model
---
# Deploying Your First Model
End-to-end tutorial for deploying `Qwen/Qwen3-0.6B` on Kubernetes using Dynamo's recommended
`DynamoGraphDeploymentRequest` (DGDR) workflow — from zero to your first inference response.
This guide assumes you have already completed the
[platform installation](/dynamo/kubernetes-deployment/deployment-guide/detailed-installation-guide) and that the Dynamo operator and CRDs are
running in your cluster.
## What is a DynamoGraphDeploymentRequest?
A `DynamoGraphDeploymentRequest` (DGDR) is Dynamo's **deploy-by-intent** API. You describe what
you want to run and your performance targets; Dynamo's profiler determines the optimal
configuration automatically, then creates the live deployment for you.
| | DGDR (this guide) | DGD (manual) |
|---|---|---|
| **You provide** | Model + optional SLA targets | Full deployment spec |
| **Profiling** | Automated | You bring your own config |
| **Best for** | Getting started, SLA-driven deployments | Fine-grained control |
For a deeper comparison, see [Understanding Dynamo's Custom Resources](/dynamo/kubernetes-deployment/deployment-guide#understanding-dynamos-custom-resources).
## Prerequisites
Before starting, confirm:
- Platform installed: `kubectl get pods -n ${NAMESPACE}` shows operator pods `Running`
- CRDs present: `kubectl get crd | grep dynamo` shows `dynamographdeploymentrequests.nvidia.com`
- `kubectl` and `helm` available in your shell
Set these variables once — they are referenced throughout the guide:
```bash
export NAMESPACE=dynamo-system # namespace where the platform is installed
export RELEASE_VERSION=1.x.x # match the installed platform version (e.g. 1.0.0)
export HF_TOKEN= # HuggingFace token
```
`Qwen/Qwen3-0.6B` is a public model. A HuggingFace token is not strictly required to download
it, but is recommended to avoid rate limiting.
## Step 1: Configure Namespace and Secrets
```bash
# Create the namespace (idempotent — safe to run even if it already exists)
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
# Create the HuggingFace token secret for model download
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="${HF_TOKEN}" \
-n ${NAMESPACE}
```
Verify the secret was created:
```bash
kubectl get secret hf-token-secret -n ${NAMESPACE}
```
## Step 2: Create the DynamoGraphDeploymentRequest
Save the following as `qwen3-first-model.yaml`:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: qwen3-first-model
spec:
# Model to profile and deploy
model: Qwen/Qwen3-0.6B
# Container image for the profiling job — must match your installed platform version.
# This is the same dynamo-frontend image used by the deployed inference service.
image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:${RELEASE_VERSION}"
```
Apply it (uses `envsubst` to substitute the `RELEASE_VERSION` shell variable into the YAML):
```bash
envsubst < qwen3-first-model.yaml | kubectl apply -f - -n ${NAMESPACE}
```
### Field reference
| Field | Required | Default | Purpose |
|---|---|---|---|
| `model` | Yes | — | HuggingFace model ID (e.g. `Qwen/Qwen3-0.6B`) |
| `image` | No | — | Container image for the profiling job (`dynamo-frontend`) |
| `backend` | No | `auto` | Inference engine (`auto`, `vllm`, `sglang`, `trtllm`) |
| `searchStrategy` | No | `rapid` | Profiling depth — `rapid` (~30s, AIC simulation) or `thorough` (2–4h, real GPUs) |
| `autoApply` | No | `true` | Automatically create and start the deployment after profiling |
| `sla` | No | — | Target latency (TTFT, ITL in ms) for profiler optimization |
| `workload` | No | — | Expected traffic shape (ISL, OSL, request rate) |
| `hardware` | No | auto-detected | GPU SKU and count override; required when GPU discovery is disabled. When not set, the auto-discovered GPU count is capped at 32 — set `hardware.totalGpus` explicitly to use more. |
For the full spec reference, see the [DGDR API Reference](/dynamo/additional-resources/api-reference-k-8-s) and
[Profiler Guide](/dynamo/components/profiler/profiler-guide).
If you are using a **namespace-scoped operator** with GPU discovery disabled, you must also
provide explicit hardware info or the DGDR will be rejected at admission:
```yaml
spec:
...
hardware:
numGpusPerNode: 1
gpuSku: "H100-SXM5-80GB"
vramMb: 81920
```
See the [installation guide](/dynamo/kubernetes-deployment/deployment-guide/detailed-installation-guide#gpu-discovery-for-dynamographdeploymentrequests-with-namespace-scoped-operators)
for details.
## Step 3: Monitor Profiling Progress
Profiling is the automated step where Dynamo sweeps across candidate configurations (parallelism, batching, scheduling strategies) to find the one that best meets your SLA and hardware — so you don't have to tune it manually.
Watch the DGDR status in real time:
```bash
kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} -w
```
The `PHASE` column progresses through:
| Phase | What is happening |
|---|---|
| `Pending` (condition: `DiscoveringHardware`) | Spec validated; operator is discovering GPU hardware and preparing the profiling job |
| `Profiling` | Profiling job is running (AIC simulation or real-GPU sweep) |
| `Ready` | Profiling complete; optimal config stored in `.status`. Terminal state when `autoApply: false` |
| `Deploying` | Creating the `DynamoGraphDeployment` (only when `autoApply: true`) |
| `Deployed` | DGD is running and healthy |
| `Failed` | Unrecoverable error — check events for details |
`Deployed` is the success terminal state when `autoApply: true` (the default).
If you set `autoApply: false`, the phase stops at `Ready` — profiling is complete and the
generated DGD spec is stored in `.status`, but no deployment is created automatically.
To inspect and deploy it manually:
```bash
# View the generated DGD spec
kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \
-o jsonpath='{.status.profilingResults.selectedConfig}' | python3 -m json.tool
# Save it and apply
kubectl get dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE} \
-o jsonpath='{.status.profilingResults.selectedConfig}' > generated-dgd.yaml
kubectl apply -f generated-dgd.yaml -n ${NAMESPACE}
```
For a full status summary and events:
```bash
kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
```
To follow the profiling job logs:
```bash
# Find the profiling pod
kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model
# Stream its logs
kubectl logs -f -n ${NAMESPACE}
```
With `searchStrategy: rapid`, profiling typically completes in under 15 minutes on a single GPU.
## Step 4: Verify the Deployment
Once the DGDR reaches `Deployed`, the `DynamoGraphDeployment` has been created automatically.
Check that everything is running:
```bash
# See the auto-created DGD
kubectl get dynamographdeployment -n ${NAMESPACE}
# Confirm all pods are Running
kubectl get pods -n ${NAMESPACE}
```
Wait until pods are ready:
```bash
kubectl wait --for=condition=ready pod \
-l nvidia.com/dynamo-deployment=qwen3-first-model \
-n ${NAMESPACE} \
--timeout=600s
```
Find the frontend service name:
```bash
kubectl get svc -n ${NAMESPACE} | grep frontend
```
## Step 5: Send Your First Request
Port-forward to the frontend and send an inference request:
```bash
# Start port-forward (replace with the name from Step 4)
kubectl port-forward svc/ 8000:8000 -n ${NAMESPACE} &
# Confirm the model is available
curl http://localhost:8000/v1/models
# Send a chat completion request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
"max_tokens": 200
}'
```
A successful response looks like:
```json
{
"id": "chatcmpl-...",
"object": "chat.completion",
"model": "Qwen/Qwen3-0.6B",
"choices": [{
"message": {
"role": "assistant",
"content": "NVIDIA Dynamo is a high-performance inference framework..."
}
}]
}
```
Your first model is now live.
## Cleanup
To remove the deployment and profiling artifacts:
```bash
kubectl delete dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
```
Deleting a DGDR does **not** delete the `DynamoGraphDeployment` it created. The DGD persists
independently so it can continue serving traffic.
## Troubleshooting
**DGDR stuck in `Pending`**
```bash
kubectl describe dynamographdeploymentrequest qwen3-first-model -n ${NAMESPACE}
# Check the Events section at the bottom
```
Common causes: no available GPU nodes, image pull failure (check image tag; NGC credentials are
optional but may be needed if you hit rate limits pulling from public NGC), missing `hardware`
config for a namespace-scoped operator.
**GPU node taints** are a frequent cause of pods staying `Pending`. Many clusters (including
GKE by default and most shared/HPC environments) taint GPU nodes with
`nvidia.com/gpu:NoSchedule` so that only GPU-aware workloads land on them. If the profiling
job pod is stuck with a `0/N nodes are available: … node(s) had untolerated taint` event,
add a toleration to your DGDR via `overrides.profilingJob`. The operator and profiler
automatically forward it to every candidate and deployed pod:
```yaml
spec:
...
overrides:
profilingJob:
template:
spec:
containers: [] # required placeholder; leave empty to inherit defaults
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```
**Profiling job fails**
```bash
kubectl get pods -n ${NAMESPACE} -l nvidia.com/dgdr-name=qwen3-first-model
kubectl logs -n ${NAMESPACE}
# If the pod has already exited:
kubectl logs -n ${NAMESPACE} --previous
```
**Pods not starting after profiling**
```bash
kubectl describe pod -n ${NAMESPACE}
# Look for ImagePullBackOff, OOMKilled, or Insufficient resources
```
**Model not responding after port-forward**
```bash
# Check frontend is ready
kubectl get pods -n ${NAMESPACE} | grep frontend
# Check frontend logs
kubectl logs -n ${NAMESPACE}
```
## Next Steps
- **Tune for production SLAs**: Add `sla` (TTFT, ITL) and `workload` (ISL, OSL) targets to
your DGDR so the profiler optimizes for your specific traffic. See the
[Profiler Guide](/dynamo/components/profiler/profiler-guide) for the full configuration
reference and picking modes. For ready-to-use YAML — including SLA targets, private models,
MoE, and overrides — see [DGDR Examples](/dynamo/components/profiler/profiler-examples).
- **Scale the deployment**: [Autoscaling guide](/dynamo/kubernetes-deployment/deployment-guide/autoscaling)
- **SLA-aware autoscaling**: Enable the Planner via `features.planner` in the DGDR —
see the [Planner Guide](/dynamo/components/planner/planner-guide).
- **Inspect the generated config**: Set `autoApply: false` and extract the DGD spec with
`kubectl get dgdr -o jsonpath='{.status.profilingResults.selectedConfig}'`
before deploying.
- **Direct control**: [Creating Deployments](/dynamo/additional-resources/creating-deployments) — write your own
`DynamoGraphDeployment` spec for full customization.
- **Monitor performance**: [Observability](/dynamo/kubernetes-deployment/observability-k-8-s/metrics)
- **Try specific backends**: [vLLM](/dynamo/backends/v-llm),
[SGLang](/dynamo/backends/sg-lang), [TensorRT-LLM](/dynamo/backends/tensor-rt-llm)