Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions recipes/components/nvidia-dra-driver-gpu/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,16 @@ resources:
# annotation can never drift from the actual chart pin. Recipes that
# disable either gpu-operator or nvidia-dra-driver-gpu leave the
# rendered values untouched.
#
# priorityClassName is explicitly neutralized (""), overriding the
# upstream chart's `system-node-critical` default. This lets the DRA
# driver install in clusters whose PriorityClass admission restricts
# `system-node-critical` to kube-system (PSA-restricted, ResourceQuota,
# or PriorityClassPolicy gates), which AICR cannot assume. The trade-off
# is that DRA pods can be evicted under node pressure; cluster operators
# who need them to survive eviction should re-pin via their own overlay.
# TODO(#1086): revisit whether a higher default (e.g., system-cluster-critical)
# is safe across all supported services.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
controller:
priorityClassName: ""
kubeletPlugin:
Expand Down
12 changes: 11 additions & 1 deletion recipes/overlays/bcm-training.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,14 @@ spec:
service: bcm
intent: training

componentRefs: []
# Enable GPUDirect Storage (GDS) for BCM training workloads. BCM-provisioned
# nodes typically ship NVIDIA-validated NVMe + ConnectX hardware where GDS
# delivers a meaningful training I/O perf win (most pronounced on H200 NVL
# given its 141GB HBM3e per device). On nodes without compatible hardware
# the nvidia-fs DaemonSet is benign — it logs a warning and stays inert.
componentRefs:
- name: gpu-operator
type: Helm
overrides:
gds:
enabled: true
23 changes: 23 additions & 0 deletions recipes/overlays/bcm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,21 @@ spec:

# Tolerate the BCM `master` label so control-plane workloads that
# already tolerate `control-plane` schedule on BCM masters as well.
# Mirrored onto kubeletPlugin so DRA's kubelet plugin DaemonSet still
# schedules on small BCM deployments that combine control-plane and
# worker roles on the same node.
#
# Note: AICR's bundler defaults to appending a blanket {operator: Exists}
# to both controller.tolerations and kubeletPlugin.tolerations via the
# nodeScheduling paths in recipes/registry.yaml (defaults sourced from
# pkg/snapshotter/agent.go DefaultTolerations). The registry maps
# controller.tolerations to the `system` scheduling scope and
# kubeletPlugin.tolerations to the `accelerated` scope. The specific
# entries below are belt-and-suspenders in default mode — the blanket
# subsumes them — and only have effect when a user drops the blanket via
# --system-node-toleration (controller) or --accelerated-node-toleration
# (kubeletPlugin), in which case BCM clusters still need `master`
# tolerated explicitly.
- name: nvidia-dra-driver-gpu
type: Helm
overrides:
Expand All @@ -69,6 +84,14 @@ spec:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
kubeletPlugin:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists

validation:
conformance:
Expand Down
Loading