From 6dff4c4599306d91053d44891a7ceb188c543726 Mon Sep 17 00:00:00 2001 From: Yuan Chen Date: Thu, 28 May 2026 10:31:16 -0700 Subject: [PATCH] feat(recipe): register h200 as first-class accelerator type Adds CriteriaAcceleratorH200 to the criteria registry so users can pass `--accelerator h200` and have the recipe metadata reflect the hardware. H200 is the same Hopper generation as H100 (same R570/R580 driver line, same gpu-operator support floor), so its deployment-phase floor mirrors H100's. Updates every surface that enumerates accelerator types per the "Adding a new enum value" audit rule in .claude/CLAUDE.md: - pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes - pkg/recipe/criteria_test.go: parse + Get tests cover h200 - pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in) for mi300x as the extensibility-test example value - api/aicr/v1/server.yaml: all 5 enum blocks - .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries - docs/{README.md,user/cli-reference.md,user/api-reference.md, contributor/api-server.md,contributor/cli.md,contributor/data.md, contributor/validations.md,contributor/api-server-extending.md} - pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go, server/doc.go}: godoc enumerations Wires up the two consumer paths the enum alone left incomplete: - pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the snapshot -> fingerprint -> recipe path resolves real H200 hardware ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku - recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0) so an `--accelerator h200` recipe inherits the same deployment-phase floor as H100 rather than landing on bare base - .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the model->accelerator mapping and valid-values tables Validated against a real H200 NVL cluster: GFD / DRA correctly identify the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0. End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu --intent training` resolves the identical deployment floor as the h100 equivalent and produces a recipe with criteria.accelerator: h200. Addresses checkbox 4 of #1086 (the H200 registration item carved out from PR #1089). --- .claude/skills/analyzing-snapshots/SKILL.md | 4 +- .github/ISSUE_TEMPLATE/bug_report.yml | 4 +- api/aicr/v1/server.yaml | 10 ++-- docs/README.md | 2 +- docs/contributor/api-server-extending.md | 2 +- docs/contributor/api-server.md | 2 +- docs/contributor/cli.md | 2 +- docs/contributor/data.md | 4 +- docs/contributor/validations.md | 2 +- docs/user/api-reference.md | 4 +- docs/user/cli-reference.md | 2 +- pkg/api/doc.go | 2 +- pkg/cli/recipe.go | 2 +- pkg/fingerprint/doc.go | 2 +- pkg/fingerprint/gpu_sku.go | 7 +++ pkg/fingerprint/gpu_sku_test.go | 4 ++ pkg/fingerprint/types.go | 2 +- pkg/recipe/criteria.go | 7 ++- pkg/recipe/criteria_registry_parse_test.go | 18 +++---- pkg/recipe/criteria_registry_test.go | 4 +- pkg/recipe/criteria_test.go | 4 +- pkg/recipe/doc.go | 5 +- pkg/server/doc.go | 2 +- recipes/overlays/h200-any.yaml | 57 +++++++++++++++++++++ 24 files changed, 115 insertions(+), 39 deletions(-) create mode 100644 recipes/overlays/h200-any.yaml diff --git a/.claude/skills/analyzing-snapshots/SKILL.md b/.claude/skills/analyzing-snapshots/SKILL.md index 333f2f089..39e8ed905 100644 --- a/.claude/skills/analyzing-snapshots/SKILL.md +++ b/.claude/skills/analyzing-snapshots/SKILL.md @@ -96,6 +96,8 @@ Key fields from `GPU.smi`: | `gb300` | gb200 class (Blackwell NVL family) | | `b200` | b200 | | `h100` | h100 | +| `gh200` | unresolved — Grace Hopper Superchip, not the discrete H200 GPU (check before h200) | +| `h200` | h200 (discrete H200 GPU) | | `a100` | a100 | | `l40` | l40 | | `rtx pro 6000` | rtx-pro-6000 | @@ -272,7 +274,7 @@ aicr recipe \ | Criteria | Extracted From | Valid Values | |----------|---------------|--------------| | service | K8s.node.provider / K8s.server.version | eks, gke, aks, oke, kind, lke | -| accelerator | GPU.smi.gpu.model | h100, gb200, b200, a100, l40, rtx-pro-6000 | +| accelerator | GPU.smi.gpu.model | h100, h200, gb200, b200, a100, l40, rtx-pro-6000 | | os | OS.release.ID | ubuntu, rhel, cos, amazonlinux | | intent | User-specified | training, inference | | platform | User-specified | dynamo, kubeflow, nim, runai, slurm | diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml index 49f864bab..296627864 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.yml +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -117,7 +117,7 @@ body: - Kubernetes version: - OS (ubuntu/cos/other) + version: - Kernel version: - - GPU type (h100/gb200/b200/a100/l40/rtx-pro-6000/other): + - GPU type (h100/h200/gb200/b200/a100/l40/rtx-pro-6000/other): - Workload intent (training/inference): value: | - AICR version (CLI `aicr version`, API image tag, or commit SHA): @@ -126,7 +126,7 @@ body: - Kubernetes version: - OS (ubuntu/cos/other) + version: - Kernel version: - - GPU type (h100/gb200/b200/a100/l40/rtx-pro-6000/other): + - GPU type (h100/h200/gb200/b200/a100/l40/rtx-pro-6000/other): - Workload intent (training/inference): validations: required: true diff --git a/api/aicr/v1/server.yaml b/api/aicr/v1/server.yaml index 717878911..dce546c82 100644 --- a/api/aicr/v1/server.yaml +++ b/api/aicr/v1/server.yaml @@ -120,7 +120,7 @@ paths: description: GPU/accelerator type. If omitted, treated as "any" (wildcard). schema: type: string - enum: [h100, gb200, b200, a100, l40, rtx-pro-6000, any] + enum: [h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any] default: any - name: gpu in: query @@ -128,7 +128,7 @@ paths: description: Alias for accelerator parameter (backwards compatibility). schema: type: string - enum: [h100, gb200, b200, a100, l40, rtx-pro-6000, any] + enum: [h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any] default: any - name: intent in: query @@ -487,7 +487,7 @@ paths: description: GPU/accelerator type. If omitted, treated as "any" (wildcard). schema: type: string - enum: [h100, gb200, b200, a100, l40, rtx-pro-6000, any] + enum: [h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any] default: any - name: gpu in: query @@ -495,7 +495,7 @@ paths: description: Alias for accelerator parameter (backwards compatibility). schema: type: string - enum: [h100, gb200, b200, a100, l40, rtx-pro-6000, any] + enum: [h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any] default: any - name: intent in: query @@ -1262,7 +1262,7 @@ components: accelerator: type: string description: GPU/accelerator type - enum: [h100, gb200, b200, a100, l40, rtx-pro-6000, any] + enum: [h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any] example: h100 intent: type: string diff --git a/docs/README.md b/docs/README.md index 08e710722..ca8b22ecf 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,7 +8,7 @@ NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the |------|-------------| | **Snapshot** | A captured state of a system including OS, kernel, Kubernetes, GPU, and SystemD configuration. Created by `aicr snapshot` or the Kubernetes agent. | | **Recipe** | A generated configuration recommendation containing component references, constraints, and deployment order. Created by `aicr recipe` based on criteria or snapshot analysis. | -| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke/bcm), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (dynamo, kubeflow, nim, runai, slurm), and `nodes`. | +| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke/bcm), `accelerator` (h100/h200/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (dynamo, kubeflow, nim, runai, slurm), and `nodes`. | | **Overlay** | A recipe metadata file that extends the base recipe for specific environments. Overlays are matched against criteria using asymmetric matching. | | **Mixin** | A composable recipe fragment (`kind: RecipeMixin`) that carries only `constraints` and `componentRefs`. Mixins live in `recipes/mixins/`, are excluded from overlay discovery, and are referenced by leaf overlays via `spec.mixins` to share orthogonal content (e.g., OS constraints, platform components) without duplication. See [ADR-005](design/005-overlay-refactoring.md). | | **Bundle** | Deployment artifacts generated from a recipe: Helm values files, Kubernetes manifests, installation scripts, and checksums. | diff --git a/docs/contributor/api-server-extending.md b/docs/contributor/api-server-extending.md index d12fa9e2e..52f497646 100644 --- a/docs/contributor/api-server-extending.md +++ b/docs/contributor/api-server-extending.md @@ -1087,7 +1087,7 @@ var validate = validator.New() type RecipeRequest struct { OS string `validate:"required,oneof=ubuntu rhel cos"` OSVersion string `validate:"omitempty,semver"` - GPU string `validate:"required,oneof=h100 gb200 b200 a100 l40 rtx-pro-6000"` + GPU string `validate:"required,oneof=h100 h200 gb200 b200 a100 l40 rtx-pro-6000"` Service string `validate:"omitempty,oneof=eks gke aks oke kind lke bcm"` } diff --git a/docs/contributor/api-server.md b/docs/contributor/api-server.md index b7dce3ee1..0e38be4e8 100644 --- a/docs/contributor/api-server.md +++ b/docs/contributor/api-server.md @@ -263,7 +263,7 @@ Supported content types: | Parameter | Type | Validation | Example | |-----------|------|------------|--------| | `service` | ServiceType | Enum: eks, gke, aks, oke, kind, lke, bcm, any | `service=eks` | -| `accelerator` | AcceleratorType | Enum: h100, gb200, b200, a100, l40, rtx-pro-6000, any | `accelerator=h100` | +| `accelerator` | AcceleratorType | Enum: h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any | `accelerator=h100` | | `gpu` | AcceleratorType | Alias for accelerator | `gpu=h100` | | `intent` | IntentType | Enum: training, inference, any | `intent=training` | | `os` | OSType | Enum: ubuntu, rhel, cos, amazonlinux, talos, any | `os=ubuntu` | diff --git a/docs/contributor/cli.md b/docs/contributor/cli.md index dcd8c0dab..180f251fa 100644 --- a/docs/contributor/cli.md +++ b/docs/contributor/cli.md @@ -1700,7 +1700,7 @@ OUTPUT_DIR="recipes" mkdir -p "$OUTPUT_DIR" # GPU types from NVIDIA product line -GPU_TYPES=("h100" "gb200" "b200" "a100" "l40" "rtx-pro-6000") +GPU_TYPES=("h100" "h200" "gb200" "b200" "a100" "l40" "rtx-pro-6000") # Kubernetes services K8S_SERVICES=("eks" "gke" "aks" "oke" "kind" "lke" "bcm") diff --git a/docs/contributor/data.md b/docs/contributor/data.md index ec9d9b1c2..3849d6f1b 100644 --- a/docs/contributor/data.md +++ b/docs/contributor/data.md @@ -124,7 +124,7 @@ Criteria define when a recipe matches a user query: | Field | Type | Description | Example Values | |-------|------|-------------|----------------| | `service` | String | Kubernetes platform | `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm` | -| `accelerator` | String | GPU hardware type | `h100`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` | +| `accelerator` | String | GPU hardware type | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` | | `os` | String | Operating system | `ubuntu`, `rhel`, `cos`, `amazonlinux` | | `intent` | String | Workload purpose | `training`, `inference` | | `platform` | String | Platform/framework type | `dynamo`, `kubeflow`, `nim`, `runai`, `slurm` | @@ -801,7 +801,7 @@ fingerprint: value: eks # eks | gke | aks | oke | kind | lke | bcm source: k8s.node.provider accelerator: - value: h100 # h100 | gb200 | b200 | a100 | l40 | rtx-pro-6000 + value: h100 # h100 | h200 | gb200 | b200 | a100 | l40 | rtx-pro-6000 source: gpu.smi.gpu.model os: value: ubuntu # ubuntu | rhel | cos | amazonlinux | talos diff --git a/docs/contributor/validations.md b/docs/contributor/validations.md index 1d841ab2f..af0bb7918 100644 --- a/docs/contributor/validations.md +++ b/docs/contributor/validations.md @@ -94,7 +94,7 @@ conditions: **Supported Condition Keys:** - `intent`: Workload intent (training, inference) - `service`: Kubernetes service (eks, gke, aks, oke, kind, lke, bcm) -- `accelerator`: GPU type (h100, gb200, b200, a100, l40, rtx-pro-6000) +- `accelerator`: GPU type (h100, h200, gb200, b200, a100, l40, rtx-pro-6000) - `os`: Operating system (ubuntu, rhel, cos, amazonlinux, talos) - `platform`: Platform/framework (dynamo, kubeflow, nim, runai, slurm) diff --git a/docs/user/api-reference.md b/docs/user/api-reference.md index 60732ddeb..ea4d03fe4 100644 --- a/docs/user/api-reference.md +++ b/docs/user/api-reference.md @@ -117,7 +117,7 @@ Generate an optimized configuration recipe based on environment parameters. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `service` | string | any | K8s service: `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm`, `any` | -| `accelerator` | string | any | GPU type: `h100`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000`, `any` | +| `accelerator` | string | any | GPU type: `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000`, `any` | | `gpu` | string | any | Alias for `accelerator` | | `intent` | string | any | Workload: `training`, `inference`, `any` | | `os` | string | any | Node OS: `ubuntu`, `rhel`, `cos`, `amazonlinux`, `talos`, `any` | @@ -821,7 +821,7 @@ openapi-generator-cli generate -i openapi.yaml -g typescript-fetch -o ./ts-clien **"Invalid accelerator type" error:** ```shell -# Use valid values: h100, gb200, b200, a100, l40, rtx-pro-6000, any +# Use valid values: h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any curl "http://localhost:8080/v1/recipe?accelerator=h100" ``` diff --git a/docs/user/cli-reference.md b/docs/user/cli-reference.md index a681f8de7..0639cf535 100644 --- a/docs/user/cli-reference.md +++ b/docs/user/cli-reference.md @@ -379,7 +379,7 @@ Generate recipes using direct system parameters: | Flag | Short | Type | Description | |------|-------|------|-------------| | `--service` | | string | K8s service: eks, gke, aks, oke, kind, lke, bcm | -| `--accelerator` | `--gpu` | string | Accelerator/GPU type: h100, gb200, b200, a100, l40, rtx-pro-6000 | +| `--accelerator` | `--gpu` | string | Accelerator/GPU type: h100, h200, gb200, b200, a100, l40, rtx-pro-6000 | | `--intent` | | string | Workload intent: training, inference | | `--os` | | string | OS family: ubuntu, rhel, cos, amazonlinux, talos | | `--platform` | | string | Platform/framework type: dynamo, kubeflow, nim, runai, slurm | diff --git a/pkg/api/doc.go b/pkg/api/doc.go index fd6afe14e..30e3c7847 100644 --- a/pkg/api/doc.go +++ b/pkg/api/doc.go @@ -65,7 +65,7 @@ // // The /v1/recipe endpoint accepts these query parameters for GET requests: // - service: Kubernetes service (eks, gke, aks, oke, kind, lke, bcm, any) -// - accelerator: GPU type (h100, gb200, b200, a100, l40, rtx-pro-6000, any) +// - accelerator: GPU type (h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any) // - gpu: Alias for accelerator (back-compat) // - intent: Workload intent (training, inference, any) // - os: Operating system (ubuntu, rhel, cos, amazonlinux, talos, any) diff --git a/pkg/cli/recipe.go b/pkg/cli/recipe.go index e0bba6651..bc2aae5f6 100644 --- a/pkg/cli/recipe.go +++ b/pkg/cli/recipe.go @@ -100,7 +100,7 @@ func recipeCmd() *cli.Command { Usage: "Create optimized recipe for given intent and environment parameters.", Description: `Generate configuration recipe based on specified environment parameters including: - Kubernetes service type (e.g. eks, gke, aks, oke, kind, lke, bcm) - - Accelerator type (e.g. h100, gb200, b200, a100, l40, rtx-pro-6000) + - Accelerator type (e.g. h100, h200, gb200, b200, a100, l40, rtx-pro-6000) - Workload intent (e.g. training, inference) - GPU node operating system (e.g. ubuntu, rhel, cos, amazonlinux, talos) - Number of GPU nodes in the cluster diff --git a/pkg/fingerprint/doc.go b/pkg/fingerprint/doc.go index be7e4e5ed..41991d2eb 100644 --- a/pkg/fingerprint/doc.go +++ b/pkg/fingerprint/doc.go @@ -18,7 +18,7 @@ // // A Fingerprint records the cluster-identity dimensions used to bind // a snapshot to a recipe — service (eks/gke/aks/oke/kind/lke/bcm), -// accelerator (h100/gb200/b200/a100/l40/rtx-pro-6000), OS +// accelerator (h100/h200/gb200/b200/a100/l40/rtx-pro-6000), OS // (ubuntu/rhel/cos/amazonlinux/talos plus raw VERSION_ID), Kubernetes // server version, region, total node count, and GPU node count. Each // dimension records the resolved value plus an optional source string diff --git a/pkg/fingerprint/gpu_sku.go b/pkg/fingerprint/gpu_sku.go index 46ecdd74d..64cb68b67 100644 --- a/pkg/fingerprint/gpu_sku.go +++ b/pkg/fingerprint/gpu_sku.go @@ -28,6 +28,13 @@ var gpuSKURegistry = []struct { {"GB200", "gb200"}, {"B200", "b200"}, {"H100", "h100"}, + // GH200 (Grace Hopper Superchip) contains the substring "H200" but is a + // distinct Grace-Hopper SKU, not the discrete H200 GPU. It has no + // first-class accelerator enum, so match it explicitly before the "H200" + // rule and leave it unresolved ("") rather than mislabeling it as h200 — + // same reason "GB200" precedes "B200" above. + {"GH200", ""}, + {"H200", "h200"}, {"A100", "a100"}, {"RTX PRO 6000", "rtx-pro-6000"}, {"L40", "l40"}, diff --git a/pkg/fingerprint/gpu_sku_test.go b/pkg/fingerprint/gpu_sku_test.go index 07408f9d6..2dea8e5a6 100644 --- a/pkg/fingerprint/gpu_sku_test.go +++ b/pkg/fingerprint/gpu_sku_test.go @@ -24,6 +24,10 @@ func TestParseGPUSKU(t *testing.T) { }{ {"H100 80GB HBM3", "NVIDIA H100 80GB HBM3", "h100"}, {"H100 PCIe", "NVIDIA H100 PCIe", "h100"}, + {"H200 NVL", "NVIDIA H200 NVL", "h200"}, + {"H200 141GB HBM3e", "NVIDIA H200 141GB HBM3e", "h200"}, + {"GH200 not matched as h200", "NVIDIA GH200 480GB", ""}, + {"GH200 Grace Hopper", "NVIDIA GH200", ""}, {"GB200 NVL72", "NVIDIA GB200", "gb200"}, {"GB200 wins over B200 substring", "NVIDIA GB200 NVL72", "gb200"}, {"B200", "NVIDIA B200", "b200"}, diff --git a/pkg/fingerprint/types.go b/pkg/fingerprint/types.go index e0b10e651..52fdef389 100644 --- a/pkg/fingerprint/types.go +++ b/pkg/fingerprint/types.go @@ -70,7 +70,7 @@ type Fingerprint struct { // k8s.node.provider (parsed from spec.providerID). Service Dimension `json:"service" yaml:"service"` - // Accelerator is the GPU SKU (h100, gb200, b200, a100, l40, + // Accelerator is the GPU SKU (h100, h200, gb200, b200, a100, l40, // rtx-pro-6000). Parsed from gpu.smi.gpu.model. Accelerator Dimension `json:"accelerator" yaml:"accelerator"` diff --git a/pkg/recipe/criteria.go b/pkg/recipe/criteria.go index 247f2a210..1c4cdf38a 100644 --- a/pkg/recipe/criteria.go +++ b/pkg/recipe/criteria.go @@ -114,6 +114,7 @@ type CriteriaAcceleratorType string const ( CriteriaAcceleratorAny CriteriaAcceleratorType = "any" CriteriaAcceleratorH100 CriteriaAcceleratorType = "h100" + CriteriaAcceleratorH200 CriteriaAcceleratorType = "h200" CriteriaAcceleratorGB200 CriteriaAcceleratorType = "gb200" CriteriaAcceleratorB200 CriteriaAcceleratorType = "b200" CriteriaAcceleratorA100 CriteriaAcceleratorType = "a100" @@ -130,6 +131,8 @@ func ParseCriteriaAcceleratorType(s string) (CriteriaAcceleratorType, error) { return CriteriaAcceleratorAny, nil case "h100": return CriteriaAcceleratorH100, nil + case "h200": + return CriteriaAcceleratorH200, nil case "gb200": return CriteriaAcceleratorGB200, nil case "b200": @@ -152,7 +155,7 @@ func ParseCriteriaAcceleratorType(s string) (CriteriaAcceleratorType, error) { // types sorted alphabetically. For the union of static + registry, use // AllCriteriaAcceleratorTypes. func GetCriteriaAcceleratorTypes() []string { - return []string{"a100", "b200", "gb200", "h100", "l40", "rtx-pro-6000"} + return []string{"a100", "b200", "gb200", "h100", "h200", "l40", "rtx-pro-6000"} } // AllCriteriaAcceleratorTypes returns the union of the static OSS list @@ -347,7 +350,7 @@ type Criteria struct { // Service is the Kubernetes service type (eks, gke, aks, oke, kind, lke, bcm). Service CriteriaServiceType `json:"service,omitempty" yaml:"service,omitempty"` - // Accelerator is the GPU/accelerator type (h100, gb200, b200, a100, l40, rtx-pro-6000). + // Accelerator is the GPU/accelerator type (h100, h200, gb200, b200, a100, l40, rtx-pro-6000). Accelerator CriteriaAcceleratorType `json:"accelerator,omitempty" yaml:"accelerator,omitempty"` // Intent is the workload intent (training, inference). diff --git a/pkg/recipe/criteria_registry_parse_test.go b/pkg/recipe/criteria_registry_parse_test.go index 85c3df18c..266e0c7a8 100644 --- a/pkg/recipe/criteria_registry_parse_test.go +++ b/pkg/recipe/criteria_registry_parse_test.go @@ -102,16 +102,16 @@ func TestParseCriteriaServiceType_StrictAcceptsEmbedded(t *testing.T) { func TestParseCriteriaAcceleratorType_RegistryFallback(t *testing.T) { withRegistry(t, func() { - if _, err := ParseCriteriaAcceleratorType("h200"); err == nil { + if _, err := ParseCriteriaAcceleratorType("mi300x"); err == nil { t.Fatal("expected error before registration") } - DefaultRegistry().Register(FieldAccelerator, "h200", OriginExternal) - got, err := ParseCriteriaAcceleratorType("h200") + DefaultRegistry().Register(FieldAccelerator, "mi300x", OriginExternal) + got, err := ParseCriteriaAcceleratorType("mi300x") if err != nil { t.Fatalf("error = %v", err) } - if got != CriteriaAcceleratorType("h200") { - t.Errorf("got = %q, want h200", got) + if got != CriteriaAcceleratorType("mi300x") { + t.Errorf("got = %q, want mi300x", got) } }) } @@ -181,7 +181,7 @@ func TestAllCriteriaServiceTypes_UnionWithRegistry(t *testing.T) { func TestAllCriteriaTypes_AllDimensions(t *testing.T) { withRegistry(t, func() { - DefaultRegistry().Register(FieldAccelerator, "h200", OriginExternal) + DefaultRegistry().Register(FieldAccelerator, "mi300x", OriginExternal) DefaultRegistry().Register(FieldIntent, "fine-tuning", OriginExternal) DefaultRegistry().Register(FieldOS, "bottlerocket", OriginExternal) DefaultRegistry().Register(FieldPlatform, "nvmesh", OriginExternal) @@ -192,7 +192,7 @@ func TestAllCriteriaTypes_AllDimensions(t *testing.T) { t.Errorf("AllCriteria%sTypes() missing %q; got %v", field, want, got) } } - assertContains("Accelerator", AllCriteriaAcceleratorTypes(), "h200") + assertContains("Accelerator", AllCriteriaAcceleratorTypes(), "mi300x") assertContains("Intent", AllCriteriaIntentTypes(), "fine-tuning") assertContains("OS", AllCriteriaOSTypes(), "bottlerocket") assertContains("Platform", AllCriteriaPlatformTypes(), "nvmesh") @@ -212,11 +212,11 @@ func TestMergeCriteriaTypes_NoMutationOfInput(t *testing.T) { func TestCriteriaValidate_AdmitsRegisteredValues(t *testing.T) { withRegistry(t, func() { DefaultRegistry().Register(FieldService, "ncp-customer-x", OriginExternal) - DefaultRegistry().Register(FieldAccelerator, "h200", OriginExternal) + DefaultRegistry().Register(FieldAccelerator, "mi300x", OriginExternal) c := &Criteria{ Service: "ncp-customer-x", - Accelerator: "h200", + Accelerator: "mi300x", Intent: CriteriaIntentTraining, OS: CriteriaOSUbuntu, } diff --git a/pkg/recipe/criteria_registry_test.go b/pkg/recipe/criteria_registry_test.go index fcf1346e7..604002361 100644 --- a/pkg/recipe/criteria_registry_test.go +++ b/pkg/recipe/criteria_registry_test.go @@ -314,7 +314,7 @@ func TestSeedCriteriaRegistry(t *testing.T) { c := &Criteria{ Service: "ncp-x", - Accelerator: "h200", + Accelerator: "mi300x", Intent: "fine-tuning", OS: "bottlerocket", Platform: "nvmesh", @@ -326,7 +326,7 @@ func TestSeedCriteriaRegistry(t *testing.T) { value string }{ {FieldService, "ncp-x"}, - {FieldAccelerator, "h200"}, + {FieldAccelerator, "mi300x"}, {FieldIntent, "fine-tuning"}, {FieldOS, "bottlerocket"}, {FieldPlatform, "nvmesh"}, diff --git a/pkg/recipe/criteria_test.go b/pkg/recipe/criteria_test.go index 21e7bfc35..12fb20630 100644 --- a/pkg/recipe/criteria_test.go +++ b/pkg/recipe/criteria_test.go @@ -72,6 +72,8 @@ func TestParseCriteriaAcceleratorType(t *testing.T) { {"any", "any", CriteriaAcceleratorAny, false}, {"h100", "h100", CriteriaAcceleratorH100, false}, {"H100 uppercase", "H100", CriteriaAcceleratorH100, false}, + {"h200", "h200", CriteriaAcceleratorH200, false}, + {"H200 uppercase", "H200", CriteriaAcceleratorH200, false}, {"gb200", "gb200", CriteriaAcceleratorGB200, false}, {"b200", "b200", CriteriaAcceleratorB200, false}, {"a100", "a100", CriteriaAcceleratorA100, false}, @@ -725,7 +727,7 @@ func TestGetCriteriaAcceleratorTypes(t *testing.T) { types := GetCriteriaAcceleratorTypes() // Should return sorted list - expected := []string{"a100", "b200", "gb200", "h100", "l40", "rtx-pro-6000"} + expected := []string{"a100", "b200", "gb200", "h100", "h200", "l40", "rtx-pro-6000"} if len(types) != len(expected) { t.Errorf("GetCriteriaAcceleratorTypes() returned %d types, want %d", len(types), len(expected)) } diff --git a/pkg/recipe/doc.go b/pkg/recipe/doc.go index 3eb341d8d..c9cda80bb 100644 --- a/pkg/recipe/doc.go +++ b/pkg/recipe/doc.go @@ -26,7 +26,7 @@ // // type Criteria struct { // Service CriteriaServiceType // eks, gke, aks, oke, kind, lke, bcm, any -// Accelerator CriteriaAcceleratorType // h100, gb200, b200, a100, l40, rtx-pro-6000, any +// Accelerator CriteriaAcceleratorType // h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any // Intent CriteriaIntentType // training, inference, any // OS CriteriaOSType // ubuntu, rhel, cos, amazonlinux, talos, any // Platform CriteriaPlatformType // dynamo, kubeflow, nim, runai, slurm, any @@ -72,6 +72,7 @@ // // Accelerator types for GPU selection: // - CriteriaAcceleratorH100: NVIDIA H100 +// - CriteriaAcceleratorH200: NVIDIA H200 // - CriteriaAcceleratorGB200: NVIDIA GB200 // - CriteriaAcceleratorB200: NVIDIA B200 // - CriteriaAcceleratorA100: NVIDIA A100 @@ -176,7 +177,7 @@ // // The HTTP handler accepts these query parameters for GET requests: // - service: eks, gke, aks, oke, kind, lke, bcm, any (default: any) -// - accelerator: h100, gb200, b200, a100, l40, rtx-pro-6000, any (default: any) +// - accelerator: h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any (default: any) // - gpu: alias for accelerator (backwards compatibility) // - intent: training, inference, any (default: any) // - os: ubuntu, rhel, cos, amazonlinux, talos, any (default: any) diff --git a/pkg/server/doc.go b/pkg/server/doc.go index 6fb4411a3..a0ac2b777 100644 --- a/pkg/server/doc.go +++ b/pkg/server/doc.go @@ -71,7 +71,7 @@ // - kernel: kernel version (e.g., 6.8, 5.15.0) // - service: eks, gke, aks, oke, kind, lke, bcm, any (default: any) // - k8s: Kubernetes version (e.g., 1.33, 1.32) -// - gpu: h100, gb200, b200, a100, l40, rtx-pro-6000, any (default: any) +// - gpu: h100, h200, gb200, b200, a100, l40, rtx-pro-6000, any (default: any) // - intent: training, inference, any (default: any) // - context: true/false - include context metadata (default: false) // diff --git a/recipes/overlays/h200-any.yaml b/recipes/overlays/h200-any.yaml new file mode 100644 index 000000000..8f609413e --- /dev/null +++ b/recipes/overlays/h200-any.yaml @@ -0,0 +1,57 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Cross-cutting overlay applied via criteria-wildcard matching. +# +# This overlay is NOT referenced by any recipe via spec.base or spec.mixins. +# The resolver picks it up for every H200 query (any service, any intent) +# because its `service: any` criterion wildcard-matches any service. It +# contributes the H200-wide deployment-phase floor — the 4 standard checks +# plus the gpu-operator version pin — without duplicating it in every +# service+intent leaf. +# +# H200 is the same Hopper generation as H100 (same R570/R580 driver line, +# same gpu-operator support floor), so this mirrors h100-any.yaml. Splitting +# only becomes necessary if an H200-specific floor delta emerges. +# +# Per-phase replace semantics in pkg/recipe/metadata.go mean concrete leaves +# that declare their own `deployment:` block continue to win — their +# constraint values match the floor here so the override is a no-op. +# +# See docs/contributor/data.md#criteria-wildcard-overlays for details. + +kind: RecipeMetadata +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: h200-any + +spec: + base: base + + criteria: + service: any + accelerator: h200 + + validation: + deployment: + checks: + - operator-health + - expected-resources + - gpu-operator-version + - check-nvidia-smi + constraints: + # H200 (Hopper) shares H100's stable gpu-operator support from + # v24.6.0 forward. Set as the floor; concrete leaves can tighten. + - name: Deployment.gpu-operator.version + value: ">= v24.6.0"