API Reference (K8s)
⚠️ Important: This documentation is automatically generated from source code. Do not edit this file directly.
API Reference
Packages
nvidia.com/v1alpha1
Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
This package defines the DynamoGraphDeploymentRequest (DGDR) custom resource, which provides a high-level, SLA-driven interface for deploying machine learning models on Dynamo.
Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
Resource Types
- DynamoComponentDeployment
- DynamoGraphDeployment
- DynamoGraphDeploymentRequest
- DynamoGraphDeploymentScalingAdapter
- DynamoModel
Autoscaling
Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md for migration guidance. This field will be removed in a future API version.
Appears in:
ComponentKind
Underlying type: string
ComponentKind represents the type of underlying Kubernetes resource.
Validation:
- Enum: [PodClique PodCliqueScalingGroup Deployment LeaderWorkerSet]
Appears in:
ConfigMapKeySelector
ConfigMapKeySelector selects a specific key from a ConfigMap. Used to reference external configuration data stored in ConfigMaps.
Appears in:
DeploymentOverridesSpec
DeploymentOverridesSpec allows users to customize metadata for auto-created DynamoGraphDeployments. When autoApply is enabled, these overrides are applied to the generated DGD resource.
Appears in:
DeploymentStatus
DeploymentStatus tracks the state of an auto-created DynamoGraphDeployment. This status is populated when autoApply is enabled and a DGD is created.
Appears in:
DynamoComponentDeployment
DynamoComponentDeployment is the Schema for the dynamocomponentdeployments API
DynamoComponentDeploymentSharedSpec
Appears in:
DynamoComponentDeploymentSpec
DynamoComponentDeploymentSpec defines the desired state of DynamoComponentDeployment
Appears in:
DynamoGraphDeployment
DynamoGraphDeployment is the Schema for the dynamographdeployments API.
DynamoGraphDeploymentRequest
DynamoGraphDeploymentRequest is the Schema for the dynamographdeploymentrequests API. It serves as the primary interface for users to request model deployments with specific performance and resource constraints, enabling SLA-driven deployments.
Lifecycle:
- Initial → Pending: Validates spec and prepares for profiling
- Pending → Profiling: Creates and runs profiling job (online or AIC)
- Profiling → Ready/Deploying: Generates DGD spec after profiling completes
- Deploying → Ready: When autoApply=true, monitors DGD until Ready
- Ready: Terminal state when DGD is operational or spec is available
- DeploymentDeleted: Terminal state when auto-created DGD is manually deleted
The spec becomes immutable once profiling starts. Users must delete and recreate the DGDR to modify configuration after this point.
DynamoGraphDeploymentRequestSpec
DynamoGraphDeploymentRequestSpec defines the desired state of a DynamoGraphDeploymentRequest. This CRD serves as the primary interface for users to request model deployments with specific performance constraints and resource requirements, enabling SLA-driven deployments.
Appears in:
DynamoGraphDeploymentRequestStatus
DynamoGraphDeploymentRequestStatus represents the observed state of a DynamoGraphDeploymentRequest. The controller updates this status as the DGDR progresses through its lifecycle.
Appears in:
DynamoGraphDeploymentScalingAdapter
DynamoGraphDeploymentScalingAdapter provides a scaling interface for individual services within a DynamoGraphDeployment. It implements the Kubernetes scale subresource, enabling integration with HPA, KEDA, and custom autoscalers.
The adapter acts as an intermediary between autoscalers and the DGD, ensuring that only the adapter controller modifies the DGD’s service replicas. This prevents conflicts when multiple autoscaling mechanisms are in play.
DynamoGraphDeploymentScalingAdapterSpec
DynamoGraphDeploymentScalingAdapterSpec defines the desired state of DynamoGraphDeploymentScalingAdapter
Appears in:
DynamoGraphDeploymentScalingAdapterStatus
DynamoGraphDeploymentScalingAdapterStatus defines the observed state of DynamoGraphDeploymentScalingAdapter
Appears in:
DynamoGraphDeploymentServiceRef
DynamoGraphDeploymentServiceRef identifies a specific service within a DynamoGraphDeployment
Appears in:
DynamoGraphDeploymentSpec
DynamoGraphDeploymentSpec defines the desired state of DynamoGraphDeployment.
Appears in:
DynamoGraphDeploymentStatus
DynamoGraphDeploymentStatus defines the observed state of DynamoGraphDeployment.
Appears in:
DynamoModel
DynamoModel is the Schema for the dynamo models API
DynamoModelSpec
DynamoModelSpec defines the desired state of DynamoModel
Appears in:
DynamoModelStatus
DynamoModelStatus defines the observed state of DynamoModel
Appears in:
EndpointInfo
EndpointInfo represents a single endpoint (pod) serving the model
Appears in:
ExtraPodMetadata
Appears in:
ExtraPodSpec
Appears in:
IngressSpec
Appears in:
IngressTLSSpec
Appears in:
ModelReference
ModelReference identifies a model served by this component
Appears in:
ModelSource
ModelSource defines the source location of a model
Appears in:
MultinodeSpec
Appears in:
PVC
Appears in:
ProfilingConfigSpec
ProfilingConfigSpec defines configuration for the profiling process. This structure maps directly to the profile_sla.py config format. See benchmarks/profiler/utils/profiler_argparse.py for the complete schema.
Appears in:
ResourceItem
Appears in:
Resources
Resources defines requested and limits for a component, including CPU, memory, GPUs/devices, and any runtime-specific resources.
Appears in:
Restart
Appears in:
RestartPhase
Underlying type: string
Appears in:
RestartStatus
RestartStatus contains the status of the restart of the graph deployment.
Appears in:
RestartStrategy
Appears in:
RestartStrategyType
Underlying type: string
Appears in:
ScalingAdapter
ScalingAdapter configures whether a service uses the DynamoGraphDeploymentScalingAdapter for replica management. When enabled, the DGDSA owns the replicas field and external autoscalers (HPA, KEDA, Planner) can control scaling via the Scale subresource.
Appears in:
ServiceReplicaStatus
ServiceReplicaStatus contains replica information for a single service.
Appears in:
SharedMemorySpec
Appears in:
VolumeMount
VolumeMount references a PVC defined at the top level for volumes to be mounted by the component
Appears in:
Operator Default Values Injection
The Dynamo operator automatically applies default values to various fields when they are not explicitly specified in your deployments. These defaults include:
-
Health Probes: Startup, liveness, and readiness probes are configured differently for frontend, worker, and planner components. For example, worker components receive a startup probe with a 2-hour timeout (720 failures × 10 seconds) to accommodate long model loading times.
-
Security Context: All components receive
fsGroup: 1000by default to ensure proper file permissions for mounted volumes. This can be overridden via theextraPodSpec.securityContextfield. -
Shared Memory: All components receive an 8Gi shared memory volume mounted at
/dev/shmby default (can be disabled or resized via thesharedMemoryfield). -
Environment Variables: Components automatically receive environment variables like
DYN_NAMESPACE,DYN_PARENT_DGD_K8S_NAME,DYNAMO_PORT, and backend-specific variables. -
Pod Configuration: Default
terminationGracePeriodSecondsof 60 seconds andrestartPolicy: Always. -
Autoscaling: When enabled without explicit metrics, defaults to CPU-based autoscaling with 80% target utilization.
-
Backend-Specific Behavior: For multinode deployments, probes are automatically modified or removed for worker nodes depending on the backend framework (VLLM, SGLang, or TensorRT-LLM).
Pod Specification Defaults
All components receive the following pod-level defaults unless overridden:
terminationGracePeriodSeconds:60secondsrestartPolicy:Always
Security Context
The operator automatically applies default security context settings to all components to ensure proper file permissions, particularly for mounted volumes:
fsGroup:1000- Sets the group ownership of mounted volumes and any files created in those volumes
This default ensures that non-root containers can write to mounted volumes (like model caches or persistent storage) without permission issues. The fsGroup setting is particularly important for:
- Model downloads and caching
- Compilation cache directories
- Persistent volume claims (PVCs)
- SSH key generation in multinode deployments
Overriding Security Context
To override the default security context, specify your own securityContext in the extraPodSpec of your component:
Important: When you provide any securityContext object in extraPodSpec, the operator will not inject any defaults. This gives you complete control over the security context, including the ability to run as root (by omitting runAsNonRoot or setting it to false).
OpenShift and Security Context Constraints
In OpenShift environments with Security Context Constraints (SCCs), you may need to omit explicit UID/GID values to allow OpenShift’s admission controllers to assign them dynamically:
Alternatively, if you want to keep the default fsGroup: 1000 behavior and are certain your cluster allows it, you don’t need to specify anything - the operator defaults will work.
Shared Memory Configuration
Shared memory is enabled by default for all components:
- Enabled:
true(unless explicitly disabled viasharedMemory.disabled) - Size:
8Gi - Mount Path:
/dev/shm - Volume Type:
emptyDirwithmemorymedium
To disable shared memory or customize the size, use the sharedMemory field in your component specification.
Health Probes by Component Type
The operator applies different default health probes based on the component type.
Frontend Components
Frontend components receive the following probe configurations:
Liveness Probe:
- Type: HTTP GET
- Path:
/health - Port:
http(8000) - Initial Delay: 60 seconds
- Period: 60 seconds
- Timeout: 30 seconds
- Failure Threshold: 10
Readiness Probe:
- Type: Exec command
- Command:
curl -s http://localhost:${DYNAMO_PORT}/health | jq -e ".status == \"healthy\"" - Initial Delay: 60 seconds
- Period: 60 seconds
- Timeout: 30 seconds
- Failure Threshold: 10
Worker Components
Worker components receive the following probe configurations:
Liveness Probe:
- Type: HTTP GET
- Path:
/live - Port:
system(9090) - Period: 5 seconds
- Timeout: 30 seconds
- Failure Threshold: 1
Readiness Probe:
- Type: HTTP GET
- Path:
/health - Port:
system(9090) - Period: 10 seconds
- Timeout: 30 seconds
- Failure Threshold: 60
Startup Probe:
- Type: HTTP GET
- Path:
/live - Port:
system(9090) - Period: 10 seconds
- Timeout: 5 seconds
- Failure Threshold: 720 (allows up to 2 hours for startup: 10s × 720 = 7200s)
:::{note}
For larger models (typically >70B parameters) or slower storage systems, you may need to increase the failureThreshold to allow more time for model loading. Calculate the required threshold based on your expected startup time: failureThreshold = (expected_startup_seconds / period). Override the startup probe in your component specification if the default 2-hour window is insufficient.
:::
Multinode Deployment Probe Modifications
For multinode deployments, the operator modifies probes based on the backend framework and node role:
VLLM Backend
The operator automatically selects between two deployment modes based on parallelism configuration:
Tensor/Pipeline Parallel Mode (when world_size > GPUs_per_node):
- Uses Ray for distributed execution (
--distributed-executor-backend ray) - Leader nodes: Starts Ray head and runs vLLM; all probes remain active
- Worker nodes: Run Ray agents only; all probes (liveness, readiness, startup) are removed
Data Parallel Mode (when world_size × data_parallel_size > GPUs_per_node):
- Worker nodes: All probes (liveness, readiness, startup) are removed
- Leader nodes: All probes remain active
SGLang Backend
- Worker nodes: All probes (liveness, readiness, startup) are removed
TensorRT-LLM Backend
- Leader nodes: All probes remain unchanged
- Worker nodes:
- Liveness and startup probes are removed
- Readiness probe is replaced with a TCP socket check on SSH port (2222):
- Initial Delay: 20 seconds
- Period: 20 seconds
- Timeout: 5 seconds
- Failure Threshold: 10
Environment Variables
The operator automatically injects environment variables based on component type and configuration:
All Components
DYN_NAMESPACE: The Dynamo namespace for the componentDYN_PARENT_DGD_K8S_NAME: The parent DynamoGraphDeployment Kubernetes resource nameDYN_PARENT_DGD_K8S_NAMESPACE: The parent DynamoGraphDeployment Kubernetes namespace
Frontend Components
DYNAMO_PORT:8000DYN_HTTP_PORT:8000
Worker Components
DYN_SYSTEM_PORT:9090(automatically enables the system metrics server)DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS:["generate"]DYN_SYSTEM_ENABLED:true(needed for runtime images 0.6.1 and older)
Planner Components
PLANNER_PROMETHEUS_PORT:9085
VLLM Backend (with compilation cache)
When a volume mount is configured with useAsCompilationCache: true:
VLLM_CACHE_ROOT: Set to the mount point of the cache volume
Service Account
Planner components automatically receive the following service account:
serviceAccountName:planner-serviceaccount
Image Pull Secrets
The operator automatically discovers and injects image pull secrets for container images. When a component specifies a container image, the operator:
- Scans all Kubernetes secrets of type
kubernetes.io/dockerconfigjsonin the component’s namespace - Extracts the docker registry server URLs from each secret’s authentication configuration
- Matches the container image’s registry host against the discovered registry URLs
- Automatically injects matching secrets as
imagePullSecretsin the pod specification
This eliminates the need to manually specify image pull secrets for each component. The operator maintains an internal index of docker secrets and their associated registries, refreshing this index periodically.
To disable automatic image pull secret discovery for a specific component, add the following annotation:
Autoscaling Defaults
When autoscaling is enabled but no metrics are specified, the operator applies:
- Default Metric: CPU utilization
- Target Average Utilization:
80%
Port Configurations
Default container ports are configured based on component type:
Frontend Components
- Port: 8000
- Protocol: TCP
- Name:
http
Worker Components
- Port: 9090
- Protocol: TCP
- Name:
system
Planner Components
- Port: 9085
- Protocol: TCP
- Name:
metrics
Backend-Specific Configurations
VLLM
- Ray Head Port: 6379 (for Ray cluster coordination in multinode TP/PP deployments)
- Data Parallel RPC Port: 13445 (for data parallel multinode deployments)
SGLang
- Distribution Init Port: 29500 (for multinode deployments)
TensorRT-LLM
- SSH Port: 2222 (for multinode MPI communication)
- OpenMPI Environment:
OMPI_MCA_orte_keep_fqdn_hostnames=1
Implementation Reference
For users who want to understand the implementation details or contribute to the operator, the default values described in this document are set in the following source files:
- Health Probes, Security Context & Pod Specifications:
internal/dynamo/graph.go- Contains the main logic for applying default probes, security context, environment variables, shared memory, and pod configurations - Component-Specific Defaults:
- Image Pull Secrets:
internal/secrets/docker.go- Implements the docker secret indexer and automatic discovery - Backend-Specific Behavior:
- Constants & Annotations:
internal/consts/consts.go- Defines annotation keys and other constants
Notes
- All these defaults can be overridden by explicitly specifying values in your DynamoComponentDeployment or DynamoGraphDeployment resources
- User-specified probes (via
livenessProbe,readinessProbe, orstartupProbefields) take precedence over operator defaults - For security context, if you provide any
securityContextinextraPodSpec, no defaults will be injected, giving you full control - For multinode deployments, some defaults are modified or removed as described above to accommodate distributed execution patterns
- The
extraPodSpec.mainContainerfield can be used to override probe configurations set by the operator