From 52fb835eed6cac596f2263a5dfc8f017429c73b2 Mon Sep 17 00:00:00 2001 From: Yuan Chen Date: Tue, 26 May 2026 15:10:21 -0700 Subject: [PATCH] fix(kwok): drop recipe suffix from argocd-helm-oci in-cluster repoURL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The argocd-helm-oci wrapper script was passing the FULL bundle URL to `helm install --set repoURL=…` (including the per-recipe chart name at the end). That matched the pre-PR-#1032 contract where the parent Application's `source.chart` was hardcoded to `aicr-bundle`. PR #1032 (and #1035's reinforcement) changed the parent App template to expect the parent-namespace-only repoURL and to append .Chart.Name itself via the separate `source.chart` field. The wrapper script wasn't updated to match. Result on every PR with argocd-helm-oci Tier-1 KWOK coverage: the parent App resolves to `oci://registry.aicr-registry.svc.cluster.local:5000/aicr//:`, the OCI artifact lookup 404s, gpu-operator-post's Application can never sync, and the whole stack times out on `GitOps sync timeout strike 1/3`. The failure was masked on `main` because the most-recent KWOK Cluster Validation run on `main` (#26469449378 at 0d3e62df, success) ran *before* PR #1035 merged. After #1035 / #1036 / #1038 all landed on main, no fresh KWOK run has triggered on `main` yet — but the next one will fail the same way every open PR's argocd-helm-oci Tier-1 jobs are currently failing. Fix is a one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the per-recipe suffix because flux's OCIRepository CR consumes the FULL artifact URL (recipe segment included). Updated the surrounding comment to point at the post-#1032 contract so the next reader understands the asymmetry. End-to-end check (verified from PR #1030's debug artifact at b3f22964): repo-server log shows `registry.aicr-registry.svc.cluster.local:5000/aicr//:: not found`, caused by the same double-append. With the recipe suffix dropped, Argo's resolution `/:` aligns with the pushed artifact at `oci://…/aicr/:`. Refs PR #1030 (where this surfaced), PR #1032 (contract change), PR #1035 (parent App template enforcement). --- kwok/scripts/validate-scheduling.sh | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/kwok/scripts/validate-scheduling.sh b/kwok/scripts/validate-scheduling.sh index b98d8b6ca..0967f58d9 100755 --- a/kwok/scripts/validate-scheduling.sh +++ b/kwok/scripts/validate-scheduling.sh @@ -755,7 +755,18 @@ generate_bundle() { # Script-global so deploy_bundle's argocd-helm-oci branch can # pass it through to `helm install --set repoURL=…` without # duplicating the runner→service-DNS rewrite rule. - OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr/${recipe}" + # + # Per PR #1032's contract change (and #1035's enforcement on the + # parent App template), --set repoURL must carry the PARENT + # NAMESPACE ONLY — without the per-recipe chart name. The + # argocd-helm parent Application appends .Chart.Name via its + # separate `source.chart` field; path-based child Applications + # append /{{ .Chart.Name }} via their template, so both halves + # resolve to the same artifact regardless of which Argo source + # type the cluster picks. The pushed artifact lives at + # oci://…/aicr/:; the recipe segment is the chart + # name, which Argo appends itself. + OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr" local in_cluster_repo="$OCI_IN_CLUSTER_REF" # Map our deployer-matrix name to aicr's --deployer value. @@ -763,7 +774,7 @@ generate_bundle() { [[ "$DEPLOYER" == "argocd-helm-oci" ]] && deployer_arg="argocd-helm" log_info "Bundling for ${deployer_arg}, pushing to ${OCI_REF}" - log_info "Argo CD will pull from ${in_cluster_repo}:${tag}" + log_info "Argo CD will pull from ${in_cluster_repo}/${recipe}:${tag} (parent namespace + .Chart.Name appended by the parent App)" # When --output is an oci:// reference, `aicr bundle` writes the # local bundle to ./bundle (relative to CWD) — there's no way to # redirect it to an absolute path. cd into WORK_DIR so the local