Openshift Container Platform (OCP) service by kaponco · Pull Request #1380 · NVIDIA/aicr

kaponco · 2026-06-16T10:42:28Z

Summary

Add OpenShift Container Platform (OCP) as a first-class service type, deploying operators via OLM using the
existing Helm bundler pipeline.

Motivation / Context

AICR currently supports EKS, GKE and more cluster providers. OCP customers need GPU-accelerated recipes but
OCP deploys operators through OLM (Operator Lifecycle Manager) rather than direct Helm installs. This PR
models each OCP operator as two in-tree Helm chart phases — an OLM installer (Subscription + OperatorGroup)
and a CR configuration phase — with Chainsaw readiness gates ensuring operators are available before CRs are
applied.

Fixes: N/A
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

Only basic components are currently maintained, as agreed to secure the design. More will be added in next PRs
Cannot overlay ocp on top of other components as they use remote helm repos at the registry level
Template rendering for in-tree manifests is taken as is. The rendering engine resolves the template at bundle time. Therefor the $v. variables in the manifests.
Trying to have similar value files for OCP as the higher overlays where possible - so it will be easier to compare

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Unit tests: All passed with race detector, coverage across packages ranging 64–100%
Lint: golangci-lint 0 issues, yamllint clean, license headers OK, AGENTS.md in sync, doc conventions pass, chart-version pins verified
E2E (chainsaw): 22/22 passed, 0 failed, 0 skipped
Vulnerability scan: 2 known low/medium findings in k8s.io/kubernetes v1.36.1 (pre-existing, not related to this change)

Everything is green — ready to commit.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

…ree Helm charts (NVIDIA#566) Add OCP as a new service type with OLM-based operator deployment using the existing KindLocalHelm bundler pipeline. Each OCP operator is modeled as two in-tree Helm chart components: an OLM installer phase and a CR configuration phase, with readiness gates between them. - Add CriteriaServiceOCP constant with "openshift" alias to criteria parser - Register six OCP components (OLM + CR pairs for NFD and GPU Operator) - Create OCP overlays (base, training, inference) that disable base components not applicable to the platform - Add Helm-templated manifests for OLM resources (OperatorGroup, Subscription) and operator CRs (NodeFeatureDiscovery, ClusterPolicy) - Add Chainsaw readiness gates that assert CSV phase Succeeded before CR deployment proceeds - Update OpenAPI spec, CLI docs, API docs, and issue templates with ocp service value - Add exhaustive switch coverage for CriteriaServiceOCP in validators - Update ADR-010 design document with implementation decisions and overlay merge constraints Signed-off-by: Shai Kapon <skapon@redhat.com>

…k Operator to base overlay (NVIDIA#566) Improve OCP recipe reliability and coverage by strengthening readiness gates, consolidating networking into the base overlay, and adding NIM Operator support for inference workloads. - Add Deployment availability checks to all OLM readiness gates (GPU, Network, NFD) to prevent premature CR creation before operator pods are ready - Register k8s-nim-operator-ocp-olm component and add ocp-inference-nim overlay for NIM-based inference on OpenShift - Expand Network Operator NicClusterPolicy values with upgrade policy, health probes, RDMA shared device plugin config, and NV-IPAM settings - Correct GPU Operator namespace from nvidia-gpu-operator to gpu-operator across values, readiness gates, and registry - Update design doc and added openshift integration specification Signed-off-by: Shai Kapon <skapon@redhat.com>

Signed-off-by: Shai Kapon <skapon@redhat.com>

…IA#566) Restructure OCP Helm component values to use flat top-level keys instead of nesting everything under a wrapper. This aligns the values files with their upstream Helm chart counterparts for easier side-by-side comparison - Flatten gpu-operator-ocp, network-operator-ocp, and nfd-ocp values from nested structure to top-level keys matching upstream charts - Expand ClusterPolicy, NicClusterPolicy, and NodeFeatureDiscovery CR templates to explicitly render each field instead of - Add template function to pkg/manifest/render.go for Helm compatibility in CR templates - Add DCGM exporter ConfigMap manifest and comprehensive metrics config to gpu-operator-ocp (dcgm-exporter-configmap.yaml) - Remove values-training.yaml — gdrcopy and migManager are now enabled in the base values, eliminating the training-specific override file - Update ocp-training overlay to drop the componentRef that pointed at the removed values-training.yaml - Register dcgm-exporter-configmap.yaml in the ocp overlay manifestFiles Signed-off-by: Shai Kapon <skapon@redhat.com>

Signed-off-by: Shai Kapon <skapon@redhat.com>

copy-pr-bot · 2026-06-16T10:42:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-16T10:50:35Z

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

This PR adds OpenShift Container Platform (OCP) as a supported service type across the codebase. A new CriteriaServiceOCP constant is introduced in pkg/recipe/criteria.go with ocp/openshift parsing aliases, propagated through the OpenAPI spec, CLI docs, and the NCCL validator scheduler. Six new Helm chart components are added under recipes/components for three operators (NFD, GPU Operator, Network Operator), each split into an OLM-phase chart (*-ocp-olm) and a CR-phase chart (*-ocp). Chainsaw readiness gate tests validate operator readiness in both recipes/components and recipes/checks. Three new recipe overlays (ocp, ocp-inference, ocp-training) and six registry entries wire the components together with explicit dependency ordering. A quote template function is added to helmFuncMap for Helm value rendering. Comprehensive design (ADR-010) and integrator documentation are included, along with minor license header updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/aicr#1233: Modifies platformWorkerScheduling in the same NCCL constraint file to add new CriteriaService* handling, directly adjacent to the OCP scheduling case added in this PR.
NVIDIA/aicr#1262: Restructures platformWorkerScheduling return signature and service-switch behavior in the same file, including nil, nil return paths that this PR extends to cover CriteriaServiceOCP.

Suggested labels

area/recipes, area/api, area/docs, size/XL, theme/recipes

Suggested reviewers

xdu31
mchmarny

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and incomplete. It states 'Openshift Container Platform (OCP) service' but lacks a verb or clear action, making it unclear what change was made.	Revise the title to include a clear action verb and complete thought, such as 'Add OpenShift Container Platform (OCP) as a service type' or 'Support OpenShift Container Platform (OCP) operators via OLM'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description comprehensively explains the motivation, implementation approach, testing results, and design decisions. It directly addresses the changeset by describing OCP service support via OLM-based operator deployment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/010-ocp-helm.md`:
- Line 71: Add language identifiers to all unlabeled fenced code blocks to
comply with markdownlint MD040. In docs/design/010-ocp-helm.md at lines 71-71,
add appropriate language tags (such as yaml, go, or text) after the opening
triple backticks for each fenced code block. Similarly, in
docs/integrator/openshift.md at lines 25-25, add language identifiers to all
unlabeled fenced code blocks. Each opening fence should specify the code
language (e.g., ```yaml instead of just ```).
- Line 5: The ADR status in the header line currently states "Proposed —
2026-06-07 (design-only; not implemented)" but this PR is shipping the actual
implementation. Update this status line to change "Proposed" to "Implemented"
(or similar status indicating the feature is complete), remove the
"(design-only; not implemented)" qualifier, and update the date to reflect when
the implementation was completed. This ensures the architecture documentation
accurately reflects the current state of the OCP support feature.

In `@pkg/recipe/doc.go`:
- Line 71: The doc block in pkg/recipe/doc.go has been updated with
CriteriaServiceOCP documentation at one location, but other service enumeration
lists within the same doc block (such as the initial service list and other
enumeration descriptions) still lack the corresponding ocp entry and
description. Update all service enumeration lists and descriptions throughout
the doc block to consistently include CriteriaServiceOCP with its "Red Hat
OpenShift Container Platform" description, ensuring the package documentation
remains internally consistent across all service listings.

In `@recipes/components/gpu-operator-ocp-olm/readiness.yaml`:
- Around line 34-50: The assertions in the readiness checks lack explicit
resource names in their metadata, causing them to match any
ClusterServiceVersion or Deployment in the gpu-operator namespace with the
matching status conditions. This can cause false positives if unrelated
resources satisfy the conditions first. Add metadata.name constraints to both
assertions: add a name field to the ClusterServiceVersion resource metadata
(around line 37) and add a name field to the Deployment resource metadata in the
operator-deployment-available test (around line 47) to ensure the assertions
target only the specific GPU operator resources, not any resource that happens
to match the status conditions.

In `@recipes/components/gpu-operator-ocp/values.yaml`:
- Around line 57-58: The dcgmExporter.config.data section in values.yaml
contains duplicate DCGM metric IDs that will cause initialization failure:
DCGM_FI_PROF_PCIE_TX_BYTES appears at both line 57 and line 96, and
DCGM_FI_PROF_PCIE_RX_BYTES appears at both line 58 and line 97. Remove the
duplicate definitions of these two metric IDs and keep only one canonical
definition per metric. Review the descriptions at both locations to determine
which version is correct (based on whether they are NVML-sourced or DCP metrics)
and retain only that single definition while removing the redundant entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b6b26084-e126-4918-8535-29d8ac6987cb

📥 Commits

Reviewing files that changed from the base of the PR and between 12b86df and d774791.

📒 Files selected for processing (40)

.claude/skills/creating-slide-decks/skeleton.html
.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
cmd/gate/chainsaw-config.yaml
docs/contributor/recipe.md
docs/design/010-ocp-helm.md
docs/integrator/openshift.md
docs/user/api-reference.md
docs/user/cli-reference.md
pkg/manifest/render.go
pkg/recipe/criteria.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
recipes/checks/gpu-operator-ocp-olm/readiness.yaml
recipes/checks/network-operator-ocp-olm/readiness.yaml
recipes/checks/nfd-ocp-olm/readiness.yaml
recipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yaml
recipes/components/gpu-operator-ocp-olm/manifests/subscription.yaml
recipes/components/gpu-operator-ocp-olm/readiness.yaml
recipes/components/gpu-operator-ocp-olm/values.yaml
recipes/components/gpu-operator-ocp/manifests/clusterpolicy.yaml
recipes/components/gpu-operator-ocp/manifests/dcgm-exporter-configmap.yaml
recipes/components/gpu-operator-ocp/values.yaml
recipes/components/network-operator-ocp-olm/manifests/operatorgroup.yaml
recipes/components/network-operator-ocp-olm/manifests/subscription.yaml
recipes/components/network-operator-ocp-olm/readiness.yaml
recipes/components/network-operator-ocp-olm/values.yaml
recipes/components/network-operator-ocp/manifests/nicclusterpolicy.yaml
recipes/components/network-operator-ocp/values.yaml
recipes/components/nfd-ocp-olm/manifests/operatorgroup.yaml
recipes/components/nfd-ocp-olm/manifests/subscription.yaml
recipes/components/nfd-ocp-olm/readiness.yaml
recipes/components/nfd-ocp-olm/values.yaml
recipes/components/nfd-ocp/manifests/nodefeaturediscovery.yaml
recipes/components/nfd-ocp/values.yaml
recipes/overlays/ocp-inference.yaml
recipes/overlays/ocp-training.yaml
recipes/overlays/ocp.yaml
recipes/registry.yaml
validators/performance/nccl_all_reduce_bw_constraint.go

github-actions · 2026-06-16T10:53:19Z

@kaponco this PR now has merge conflicts with main. Please rebase to resolve them.

unsupported DCGM PCIe metrics Signed-off-by: Shai Kapon <skapon@redhat.com>

…duleType (NVIDIA#566) - Add OPERATOR_NAMESPACE env var to OLM Subscription config so the GPU Operator resolves its install namespace at runtime instead of falling back to a hard-coded value - Add driver.kernelModuleType field to ClusterPolicy template (defaultsto auto) - Add clarifying comment on operatorGroup targetNamespace choice Signed-off-by: Shai Kapon <skapon@redhat.com>

kaponco added 7 commits June 8, 2026 15:35

Adding ocp design file based on helm

e278339

update ocp design

981922a

feat(recipe): remove OCP NIM operator overlay and component (NVIDIA#566)

2aa3ee6

Signed-off-by: Shai Kapon <skapon@redhat.com>

Fix lint yaml issue (NVIDIA#566)

d774791

Signed-off-by: Shai Kapon <skapon@redhat.com>

kaponco requested review from a team as code owners June 16, 2026 10:42

github-actions Bot added area/recipes area/ci area/docs area/api size/XL labels Jun 16, 2026

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread docs/design/010-ocp-helm.md Outdated

Comment thread docs/design/010-ocp-helm.md Outdated

Comment thread pkg/recipe/doc.go

Comment thread recipes/components/gpu-operator-ocp-olm/readiness.yaml

Comment thread recipes/components/gpu-operator-ocp/values.yaml Outdated

github-actions Bot added the needs-rebase label Jun 16, 2026

mchmarny assigned kaponco Jun 16, 2026

kaponco added 2 commits June 16, 2026 15:30

feat(ocp): polish OCP docs, add ocp to service enums, and drop

53676ff

unsupported DCGM PCIe metrics Signed-off-by: Shai Kapon <skapon@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openshift Container Platform (OCP) service #1380

Openshift Container Platform (OCP) service #1380
kaponco wants to merge 9 commits into
NVIDIA:mainfrom
kaponco:feat/ocp-helm

kaponco commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review failed

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaponco commented Jun 16, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading