From 1aa3bfeed1e66ab5cfcd1f349bb624f8c2afad35 Mon Sep 17 00:00:00 2001 From: rsd-darshan Date: Sun, 21 Jun 2026 14:33:55 +0545 Subject: [PATCH 1/4] fix(test): argocd-sync gate: use all-semantics, handle health-gated apps Replace exist-semantics assert with inverted-polarity error: to enforce that ALL Applications reach terminal-pass state, not just any one. Fixes issue #1288 argocd half (flux half was #1290). Changes: - Add staged assert-root-app-succeeded step that gates on operationState.phase==Succeeded before fleet sweep, closing apply-time race where kubectl apply creates root with empty status - Switch assert-all-applications-pass from assert: to error: with inverted predicate, achieving "all Applications pass" vs "at least one passes" - Remove operationState.phase==Succeeded from per-child predicate to handle health-gated applications (e.g., kube-prometheus-stack) that stay Progressing on KWOK (kubelet never comes up) but are genuinely synced (sync.status==Synced sufficient) - Bump timeout from 5m to 8m to match KWOK_FLUX_SYNC_TIMEOUT and give the more honest gate adequate time without racing See ADR-010 "Sync Gate: All-Resources Semantics" and issue #1061 for operationState.phase rationale. --- .../kwok/argocd-sync/chainsaw-test.yaml | 92 +++++++++++++------ 1 file changed, 66 insertions(+), 26 deletions(-) diff --git a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml index 29c58cca6..7126ddf05 100644 --- a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml @@ -64,8 +64,10 @@ spec: # KWOK_ARGOCD_SYNC_TIMEOUT default. The argocd-helm-oci wrapper takes # slightly longer to converge than argocd-oci because the parent # Application has to render the child Apps before they begin - # reconciling. 5 minutes is conservative for KWOK pod-readiness sim. - assert: 5m + # reconciling. 8 minutes matches KWOK_FLUX_SYNC_TIMEOUT to accommodate + # the additional latency of the more honest gate (all-semantics vs + # exists-semantics) without racing; see issue #1288. + assert: 8m # Skip namespace and resource deletion during cleanup. The test creates # no resources of its own (it only asserts on Application CRs in the # argocd namespace), so there's nothing to clean. Chainsaw's default @@ -98,41 +100,79 @@ spec: echo "--- Applications in argocd namespace ---" kubectl get applications -n argocd 2>&1 || true + - name: assert-root-app-succeeded + description: | + The root Application must reach operationState.phase==Succeeded before + we assert on the entire fleet. This gates the sweep against apply-time + races where kubectl apply creates the root with empty status and the + name-less sweep evaluates before the root has begun reconciliation, + causing a vacuous pass. For an App-of-Apps, this ensures children are + materialized (Argo's sync operation is not Succeeded until the parent + app has recursively applied all children). + try: + - assert: + resource: + apiVersion: argoproj.io/v1alpha1 + kind: Application + metadata: + name: ($rootApp) + namespace: argocd + status: + (operationState.phase == 'Succeeded'): true + catch: + - script: + content: | + echo "--- Root Application status ---" + kubectl get application -n argocd -o yaml ($rootApp) 2>&1 || true + echo "--- argocd-application-controller (tail=200) ---" + kubectl logs -n argocd statefulset/argocd-application-controller --tail=200 2>&1 \ + || kubectl logs -n argocd deploy/argocd-application-controller --tail=200 2>&1 \ + || true + - name: assert-all-applications-pass description: | Every Application in the argocd namespace must satisfy the 4-arm terminal-pass predicate. Chainsaw polls until convergence or the - spec.timeouts.assert deadline fires. + spec.timeouts.assert deadline fires. Uses inverted-polarity error + semantics to assert "all Applications in terminal state" rather than + "at least one Application in terminal state" — the error: block fails + when ANY Application *fails* the predicate (achieves all-semantics), + whereas assert: passes when ANY Application *matches* (exists-semantics). + See ADR-010 "Sync Gate: All-Resources Semantics". + + Field paths are relative to `status:` here (e.g., `sync.status` not + `status.sync.status`). The per-child predicate does NOT include + `operationState.phase == 'Succeeded'` because on KWOK, health-gated + Applications (e.g., kube-prometheus-stack waiting for the kubelet to + report Pod readiness) may have `sync.status==Synced` but their sync + operation stays Running forever (the kubelet never comes up on KWOK). + `sync.status==Synced` alone is sufficient — it implies the sync + completed (or was aborted), and we skip the health-awaiting apps that + would hang the gate. The root check (assert-root-app-succeeded above) + still gates on `operationState.phase==Succeeded` to close the + apply-time race. + + Arms (all require only sync + health state): + # 1. Synced + Healthy — canonical pass + # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) + # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) + # 4. Synced + Degraded — health controller divergence after a successful op try: - - assert: + - error: + # Inverted polarity: error: fails when ANY Application does NOT + # match the predicate. This achieves "assert all" semantics. + # Zero Applications is a pass (the sweep matched nothing, so there + # are no failures). resource: apiVersion: argoproj.io/v1alpha1 kind: Application metadata: namespace: argocd status: - # The 4-arm OR is encoded as a single JMESPath expression - # because YAML doesn't accept multi-line mapping keys. - # Field paths are relative to `status:` here (e.g., - # `sync.status` not `status.sync.status`). All four arms - # require `operationState.phase == 'Succeeded'` so that - # an Application can only count as passing AFTER Argo - # CD has completed at least one sync operation. For an - # App-of-Apps, that's the gate ensuring the child - # Applications have been materialized before the - # selector's "every Application in argocd ns" sweep - # accepts the root alone — otherwise the assertion - # races ahead and passes on the root's transient - # Synced state, returning before children even exist. - # See #1061 (and #1050 for the migration that - # introduced the race). - # - # Arms: - # 1. Synced + Healthy — canonical pass - # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) - # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) - # 4. Synced + Degraded — health controller divergence after a successful op - (operationState.phase == 'Succeeded' && ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded'))): true + # NOT (any of the 4 pass states) == NOT pass + # Negated via DeMorgan: NOT (A OR B OR C OR D) == (NOT A AND NOT B AND NOT C AND NOT D) + # Encode as: all 4 conditions false => application fails predicate + ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded')): false catch: - script: content: | From 76134cd9f5dea37c8f84611fed6acb511ce463c7 Mon Sep 17 00:00:00 2001 From: rsd-darshan Date: Sun, 21 Jun 2026 15:24:49 +0545 Subject: [PATCH 2/4] docs(test): clarify arms description for argocd-sync gate Explicitly note that operationState.phase==Succeeded is omitted from per-child checks (only validated on root) to avoid hanging on health-gated applications like kube-prometheus-stack that stay op=Running forever on KWOK (kubelet never comes up). --- tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml index 6bd558e91..e2a4abdfd 100644 --- a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml @@ -159,11 +159,14 @@ spec: still gates on `operationState.phase==Succeeded` to close the apply-time race. - Arms (all require only sync + health state): + Arms (all require only sync + health state, no operationState check): # 1. Synced + Healthy — canonical pass # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) # 4. Synced + Degraded — health controller divergence after a successful op + # Note: operationState.phase==Succeeded is gated by assert-root-app-succeeded + # to close apply-time races; per-child checks omit it to avoid hanging on + # health-gated apps like kube-prometheus-stack that stay op=Running forever on KWOK. try: - error: # Inverted polarity: error: fails when ANY Application does NOT From 698f0afbe60b2e449797f248cba10da85fad3cb8 Mon Sep 17 00:00:00 2001 From: rsd-darshan Date: Sun, 21 Jun 2026 15:26:10 +0545 Subject: [PATCH 3/4] fix(test): apply argocd-sync gate all-semantics fix to argocd-git-sync The argocd-git-sync test is a sibling of argocd-sync that MUST stay byte-identical in assert-root-app-present and assert-all-applications-pass steps (SYNC NOTE in both files). Apply the same all-semantics fix: - Add staged assert-root-app-succeeded step - Switch from assert: to inverted-polarity error: - Remove operationState.phase==Succeeded from per-child check - Update timeout from 5m to 8m to match argocd-sync - Update header comments to reflect new semantics Ensures both lanes pass only when ALL Applications reach terminal state, not just any one, and handles health-gated applications that hang on KWOK. --- .../kwok/argocd-git-sync/chainsaw-test.yaml | 108 +++++++++++++----- .../kwok/argocd-sync/chainsaw-test.yaml | 14 ++- 2 files changed, 85 insertions(+), 37 deletions(-) diff --git a/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml index 5d32b7963..c4fa0068a 100644 --- a/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml @@ -29,12 +29,16 @@ # which is exactly what Argo CD's repo-server clones from. # # Pass predicate — every Application in the `argocd` namespace must match -# one of these 4 terminal-pass states (mirroring argocd-sync): +# one of these 4 terminal-pass states (mirroring argocd-sync). Note: +# operationState.phase==Succeeded is validated only on the root Application +# (assert-root-app-succeeded step) to close apply-time races; per-child +# checks omit it to avoid hanging on health-gated apps like +# kube-prometheus-stack that stay op=Running forever. # -# 1. Synced + Healthy — canonical pass -# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) -# 3. OutOfSync + Healthy + operationState=Succeeded — operator mutation (ClusterPolicy, DeviceClass, etc.) -# 4. Synced + Degraded + operationState=Succeeded — health controller divergence after successful op +# 1. Synced + Healthy — canonical pass +# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) +# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass, etc.) +# 4. Synced + Degraded — health controller divergence after successful op apiVersion: chainsaw.kyverno.io/v1alpha1 kind: Test metadata: @@ -58,8 +62,9 @@ spec: - name: repoURL value: ($values.repoURL || '') timeouts: - # KWOK_ARGOCD_SYNC_TIMEOUT default. Matches argocd-sync. - assert: 5m + # KWOK_ARGOCD_SYNC_TIMEOUT default. Matches argocd-sync (8 minutes for + # all-semantics gate; see issue #1288). + assert: 8m # See argocd-sync for the skipDelete rationale (KWOK has no # kube-controller-manager to finalize the auto-created test namespace). skipDelete: true @@ -115,41 +120,82 @@ spec: echo "--- Applications in argocd namespace ---" kubectl get applications -n argocd 2>&1 || true + - name: assert-root-app-succeeded + description: | + The root Application must reach operationState.phase==Succeeded before + we assert on the entire fleet. This gates the sweep against apply-time + races where kubectl apply creates the root with empty status and the + name-less sweep evaluates before the root has begun reconciliation, + causing a vacuous pass. For an App-of-Apps, this ensures children are + materialized (Argo's sync operation is not Succeeded until the parent + app has recursively applied all children). + try: + - assert: + resource: + apiVersion: argoproj.io/v1alpha1 + kind: Application + metadata: + name: ($rootApp) + namespace: argocd + status: + (operationState.phase == 'Succeeded'): true + catch: + - script: + content: | + echo "--- Root Application status ---" + kubectl get application -n argocd -o yaml ($rootApp) 2>&1 || true + echo "--- argocd-application-controller (tail=200) ---" + kubectl logs -n argocd statefulset/argocd-application-controller --tail=200 2>&1 \ + || kubectl logs -n argocd deploy/argocd-application-controller --tail=200 2>&1 \ + || true + - name: assert-all-applications-pass description: | Every Application in the argocd namespace must satisfy the 4-arm terminal-pass predicate. Chainsaw polls until convergence or the - spec.timeouts.assert deadline fires. + spec.timeouts.assert deadline fires. Uses inverted-polarity error + semantics to assert "all Applications in terminal state" rather than + "at least one Application in terminal state" — the error: block fails + when ANY Application *fails* the predicate (achieves all-semantics), + whereas assert: passes when ANY Application *matches* (exists-semantics). + See ADR-010 "Sync Gate: All-Resources Semantics". + + Field paths are relative to `status:` here (e.g., `sync.status` not + `status.sync.status`). The per-child predicate does NOT include + `operationState.phase == 'Succeeded'` because on KWOK, health-gated + Applications (e.g., kube-prometheus-stack waiting for the kubelet to + report Pod readiness) may have `sync.status==Synced` but their sync + operation stays Running forever (the kubelet never comes up on KWOK). + `sync.status==Synced` alone is sufficient — it implies the sync + completed (or was aborted), and we skip the health-awaiting apps that + would hang the gate. The root check (assert-root-app-succeeded above) + still gates on `operationState.phase==Succeeded` to close the + apply-time race. + + Arms (all require only sync + health state, no operationState check): + # 1. Synced + Healthy — canonical pass + # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) + # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) + # 4. Synced + Degraded — health controller divergence after a successful op + # Note: operationState.phase==Succeeded is gated by assert-root-app-succeeded + # to close apply-time races; per-child checks omit it to avoid hanging on + # health-gated apps like kube-prometheus-stack that stay op=Running forever on KWOK. try: - - assert: + - error: + # Inverted polarity: error: fails when ANY Application does NOT + # match the predicate. This achieves "assert all" semantics. + # Zero Applications is a pass (the sweep matched nothing, so there + # are no failures). resource: apiVersion: argoproj.io/v1alpha1 kind: Application metadata: namespace: argocd status: - # The 4-arm OR is encoded as a single JMESPath expression - # because YAML doesn't accept multi-line mapping keys. - # Field paths are relative to `status:` here (e.g., - # `sync.status` not `status.sync.status`). All four arms - # require `operationState.phase == 'Succeeded'` so that - # an Application can only count as passing AFTER Argo - # CD has completed at least one sync operation. For an - # App-of-Apps, that's the gate ensuring the child - # Applications have been materialized before the - # selector's "every Application in argocd ns" sweep - # accepts the root alone — otherwise the assertion - # races ahead and passes on the root's transient - # Synced state, returning before children even exist. - # See #1061 (and #1050 for the migration that - # introduced the race). - # - # Arms: - # 1. Synced + Healthy — canonical pass - # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) - # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) - # 4. Synced + Degraded — health controller divergence after a successful op - (operationState.phase == 'Succeeded' && ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded'))): true + # NOT (any of the 4 pass states) == NOT pass + # Negated via DeMorgan: NOT (A OR B OR C OR D) == (NOT A AND NOT B AND NOT C AND NOT D) + # Encode as: all 4 conditions false => application fails predicate + ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded')): false catch: - script: content: | diff --git a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml index e2a4abdfd..84db51154 100644 --- a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml @@ -32,13 +32,15 @@ # selects the right name from DEPLOYER. # # Pass predicate — every Application in the `argocd` namespace must match -# one of these 4 terminal-pass states (mirroring the pre-migration jq -# disjunction at validate-scheduling.sh:~1378): +# one of these 4 terminal-pass states. Note: operationState.phase==Succeeded +# is validated only on the root Application (assert-root-app-succeeded step) +# to close apply-time races; per-child checks omit it to avoid hanging on +# health-gated apps like kube-prometheus-stack that stay op=Running forever. # -# 1. Synced + Healthy — canonical pass -# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) -# 3. OutOfSync + Healthy + operationState=Succeeded — operator mutation (ClusterPolicy, DeviceClass, etc.) -# 4. Synced + Degraded + operationState=Succeeded — health controller divergence after successful op +# 1. Synced + Healthy — canonical pass +# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) +# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass, etc.) +# 4. Synced + Degraded — health controller divergence after successful op # # The 4-arm disjunction is encoded as a single JMESPath expression that # Chainsaw re-evaluates per Application matched by the resource selector. From 87b7b417010e4a9f382f0b71e88df19386041b95 Mon Sep 17 00:00:00 2001 From: rsd-darshan Date: Sun, 21 Jun 2026 17:03:37 +0545 Subject: [PATCH 4/4] fix(test): coordinate argocd sync timeout default with chainsaw spec KWOK_ARGOCD_SYNC_TIMEOUT defaulted to 300s while the chainsaw test's spec.timeouts.assert was bumped to 8m (480s). Since --assert-timeout overrides the YAML value at invocation, the env var default was the one actually in effect, silently undoing the 8m budget the all-semantics gate needs. Bump the default to 480 and fix a stale comment that referenced KWOK_FLUX_SYNC_TIMEOUT instead of KWOK_ARGOCD_SYNC_TIMEOUT. --- docs/contributor/tests.md | 6 ++++-- kwok/scripts/validate-scheduling.sh | 5 +++-- tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml | 8 ++++---- 3 files changed, 11 insertions(+), 8 deletions(-) diff --git a/docs/contributor/tests.md b/docs/contributor/tests.md index dd6e48475..46bd5dec9 100644 --- a/docs/contributor/tests.md +++ b/docs/contributor/tests.md @@ -371,7 +371,7 @@ independent. | Variable | Default | Purpose | |----------|---------|---------| -| `KWOK_ARGOCD_SYNC_TIMEOUT` | `300` s | Deadline for all child Argo CD Applications to reach `Synced+Healthy` | +| `KWOK_ARGOCD_SYNC_TIMEOUT` | `480` s | Deadline for all child Argo CD Applications to reach `Synced+Healthy` | | `KWOK_ARGOCD_ROOT_GRACE` | `30` s | Grace period for the root Application before deadline counting starts | | `KWOK_FLUX_SYNC_TIMEOUT` | `500` s | Deadline for source fetch (OCIRepository or GitRepository) + Kustomization apply + HelmReleases `Ready=True` + ArtifactGenerators Ready | | `KWOK_FLUX_ROOT_GRACE` | `30` s | Grace period for the outer Kustomization before deadline counting starts | @@ -385,7 +385,9 @@ credential for the ephemeral in-cluster Gitea, not a secret. `argocd-git` reuses the `KWOK_ARGOCD_SYNC_TIMEOUT` budget. On a clean local Kind cluster `Synced+Healthy` lands in ~30 s; the -300-second default exists to absorb CI variance. If a local run trips +480-second default exists to absorb CI variance (the all-semantics gate +waits for every Application, not just the first one, so it needs more +budget than the old exists-semantics check did). If a local run trips code 50 but the cluster is otherwise healthy, raise the relevant timeout before assuming the recipe is broken — cold-cluster image pulls are the most common cause. diff --git a/kwok/scripts/validate-scheduling.sh b/kwok/scripts/validate-scheduling.sh index 97846b97d..7272184f1 100755 --- a/kwok/scripts/validate-scheduling.sh +++ b/kwok/scripts/validate-scheduling.sh @@ -63,7 +63,8 @@ # free as well). # KWOK_ARGOCD_SYNC_TIMEOUT Seconds to wait for all Argo CD Applications to # reach Synced + Healthy (or Progressing) before -# failing. Default: 300. +# failing. Default: 480 (matches the chainsaw test's +# spec.timeouts.assert: 8m). # KWOK_ARGOCD_ROOT_GRACE Seconds to wait for the root Argo CD Application # (nvidia-stack for argocd-oci, aicr-stack for # argocd-helm-oci) to appear in the argocd @@ -1552,7 +1553,7 @@ wait_for_argocd_sync() { test_dir="${REPO_ROOT}/tests/chainsaw/kwok/argocd-sync" ;; esac - local sync_timeout="${KWOK_ARGOCD_SYNC_TIMEOUT:-300}s" + local sync_timeout="${KWOK_ARGOCD_SYNC_TIMEOUT:-480}s" log_info "Argo CD sync gate (chainsaw): rootApp=${ARGOCD_ROOT_APP} timeout=${sync_timeout}" diff --git a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml index 84db51154..0ea19d90c 100644 --- a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml @@ -70,10 +70,10 @@ spec: - name: rootApp value: ($values.rootApp || 'aicr-stack') timeouts: - # KWOK_ARGOCD_SYNC_TIMEOUT default. The argocd-helm-oci wrapper takes - # slightly longer to converge than argocd-oci because the parent - # Application has to render the child Apps before they begin - # reconciling. 8 minutes matches KWOK_FLUX_SYNC_TIMEOUT to accommodate + # KWOK_ARGOCD_SYNC_TIMEOUT default (480s, matching the bash driver's + # default below). The argocd-helm-oci wrapper takes slightly longer to + # converge than argocd-oci because the parent Application has to render + # the child Apps before they begin reconciling. 8 minutes accommodates # the additional latency of the more honest gate (all-semantics vs # exists-semantics) without racing; see issue #1288. assert: 8m