diff --git a/docs/contributor/tests.md b/docs/contributor/tests.md index dd6e48475..46bd5dec9 100644 --- a/docs/contributor/tests.md +++ b/docs/contributor/tests.md @@ -371,7 +371,7 @@ independent. | Variable | Default | Purpose | |----------|---------|---------| -| `KWOK_ARGOCD_SYNC_TIMEOUT` | `300` s | Deadline for all child Argo CD Applications to reach `Synced+Healthy` | +| `KWOK_ARGOCD_SYNC_TIMEOUT` | `480` s | Deadline for all child Argo CD Applications to reach `Synced+Healthy` | | `KWOK_ARGOCD_ROOT_GRACE` | `30` s | Grace period for the root Application before deadline counting starts | | `KWOK_FLUX_SYNC_TIMEOUT` | `500` s | Deadline for source fetch (OCIRepository or GitRepository) + Kustomization apply + HelmReleases `Ready=True` + ArtifactGenerators Ready | | `KWOK_FLUX_ROOT_GRACE` | `30` s | Grace period for the outer Kustomization before deadline counting starts | @@ -385,7 +385,9 @@ credential for the ephemeral in-cluster Gitea, not a secret. `argocd-git` reuses the `KWOK_ARGOCD_SYNC_TIMEOUT` budget. On a clean local Kind cluster `Synced+Healthy` lands in ~30 s; the -300-second default exists to absorb CI variance. If a local run trips +480-second default exists to absorb CI variance (the all-semantics gate +waits for every Application, not just the first one, so it needs more +budget than the old exists-semantics check did). If a local run trips code 50 but the cluster is otherwise healthy, raise the relevant timeout before assuming the recipe is broken — cold-cluster image pulls are the most common cause. diff --git a/kwok/scripts/validate-scheduling.sh b/kwok/scripts/validate-scheduling.sh index 97846b97d..7272184f1 100755 --- a/kwok/scripts/validate-scheduling.sh +++ b/kwok/scripts/validate-scheduling.sh @@ -63,7 +63,8 @@ # free as well). # KWOK_ARGOCD_SYNC_TIMEOUT Seconds to wait for all Argo CD Applications to # reach Synced + Healthy (or Progressing) before -# failing. Default: 300. +# failing. Default: 480 (matches the chainsaw test's +# spec.timeouts.assert: 8m). # KWOK_ARGOCD_ROOT_GRACE Seconds to wait for the root Argo CD Application # (nvidia-stack for argocd-oci, aicr-stack for # argocd-helm-oci) to appear in the argocd @@ -1552,7 +1553,7 @@ wait_for_argocd_sync() { test_dir="${REPO_ROOT}/tests/chainsaw/kwok/argocd-sync" ;; esac - local sync_timeout="${KWOK_ARGOCD_SYNC_TIMEOUT:-300}s" + local sync_timeout="${KWOK_ARGOCD_SYNC_TIMEOUT:-480}s" log_info "Argo CD sync gate (chainsaw): rootApp=${ARGOCD_ROOT_APP} timeout=${sync_timeout}" diff --git a/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml index 5d32b7963..c4fa0068a 100644 --- a/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml @@ -29,12 +29,16 @@ # which is exactly what Argo CD's repo-server clones from. # # Pass predicate — every Application in the `argocd` namespace must match -# one of these 4 terminal-pass states (mirroring argocd-sync): +# one of these 4 terminal-pass states (mirroring argocd-sync). Note: +# operationState.phase==Succeeded is validated only on the root Application +# (assert-root-app-succeeded step) to close apply-time races; per-child +# checks omit it to avoid hanging on health-gated apps like +# kube-prometheus-stack that stay op=Running forever. # -# 1. Synced + Healthy — canonical pass -# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) -# 3. OutOfSync + Healthy + operationState=Succeeded — operator mutation (ClusterPolicy, DeviceClass, etc.) -# 4. Synced + Degraded + operationState=Succeeded — health controller divergence after successful op +# 1. Synced + Healthy — canonical pass +# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) +# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass, etc.) +# 4. Synced + Degraded — health controller divergence after successful op apiVersion: chainsaw.kyverno.io/v1alpha1 kind: Test metadata: @@ -58,8 +62,9 @@ spec: - name: repoURL value: ($values.repoURL || '') timeouts: - # KWOK_ARGOCD_SYNC_TIMEOUT default. Matches argocd-sync. - assert: 5m + # KWOK_ARGOCD_SYNC_TIMEOUT default. Matches argocd-sync (8 minutes for + # all-semantics gate; see issue #1288). + assert: 8m # See argocd-sync for the skipDelete rationale (KWOK has no # kube-controller-manager to finalize the auto-created test namespace). skipDelete: true @@ -115,41 +120,82 @@ spec: echo "--- Applications in argocd namespace ---" kubectl get applications -n argocd 2>&1 || true + - name: assert-root-app-succeeded + description: | + The root Application must reach operationState.phase==Succeeded before + we assert on the entire fleet. This gates the sweep against apply-time + races where kubectl apply creates the root with empty status and the + name-less sweep evaluates before the root has begun reconciliation, + causing a vacuous pass. For an App-of-Apps, this ensures children are + materialized (Argo's sync operation is not Succeeded until the parent + app has recursively applied all children). + try: + - assert: + resource: + apiVersion: argoproj.io/v1alpha1 + kind: Application + metadata: + name: ($rootApp) + namespace: argocd + status: + (operationState.phase == 'Succeeded'): true + catch: + - script: + content: | + echo "--- Root Application status ---" + kubectl get application -n argocd -o yaml ($rootApp) 2>&1 || true + echo "--- argocd-application-controller (tail=200) ---" + kubectl logs -n argocd statefulset/argocd-application-controller --tail=200 2>&1 \ + || kubectl logs -n argocd deploy/argocd-application-controller --tail=200 2>&1 \ + || true + - name: assert-all-applications-pass description: | Every Application in the argocd namespace must satisfy the 4-arm terminal-pass predicate. Chainsaw polls until convergence or the - spec.timeouts.assert deadline fires. + spec.timeouts.assert deadline fires. Uses inverted-polarity error + semantics to assert "all Applications in terminal state" rather than + "at least one Application in terminal state" — the error: block fails + when ANY Application *fails* the predicate (achieves all-semantics), + whereas assert: passes when ANY Application *matches* (exists-semantics). + See ADR-010 "Sync Gate: All-Resources Semantics". + + Field paths are relative to `status:` here (e.g., `sync.status` not + `status.sync.status`). The per-child predicate does NOT include + `operationState.phase == 'Succeeded'` because on KWOK, health-gated + Applications (e.g., kube-prometheus-stack waiting for the kubelet to + report Pod readiness) may have `sync.status==Synced` but their sync + operation stays Running forever (the kubelet never comes up on KWOK). + `sync.status==Synced` alone is sufficient — it implies the sync + completed (or was aborted), and we skip the health-awaiting apps that + would hang the gate. The root check (assert-root-app-succeeded above) + still gates on `operationState.phase==Succeeded` to close the + apply-time race. + + Arms (all require only sync + health state, no operationState check): + # 1. Synced + Healthy — canonical pass + # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) + # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) + # 4. Synced + Degraded — health controller divergence after a successful op + # Note: operationState.phase==Succeeded is gated by assert-root-app-succeeded + # to close apply-time races; per-child checks omit it to avoid hanging on + # health-gated apps like kube-prometheus-stack that stay op=Running forever on KWOK. try: - - assert: + - error: + # Inverted polarity: error: fails when ANY Application does NOT + # match the predicate. This achieves "assert all" semantics. + # Zero Applications is a pass (the sweep matched nothing, so there + # are no failures). resource: apiVersion: argoproj.io/v1alpha1 kind: Application metadata: namespace: argocd status: - # The 4-arm OR is encoded as a single JMESPath expression - # because YAML doesn't accept multi-line mapping keys. - # Field paths are relative to `status:` here (e.g., - # `sync.status` not `status.sync.status`). All four arms - # require `operationState.phase == 'Succeeded'` so that - # an Application can only count as passing AFTER Argo - # CD has completed at least one sync operation. For an - # App-of-Apps, that's the gate ensuring the child - # Applications have been materialized before the - # selector's "every Application in argocd ns" sweep - # accepts the root alone — otherwise the assertion - # races ahead and passes on the root's transient - # Synced state, returning before children even exist. - # See #1061 (and #1050 for the migration that - # introduced the race). - # - # Arms: - # 1. Synced + Healthy — canonical pass - # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) - # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) - # 4. Synced + Degraded — health controller divergence after a successful op - (operationState.phase == 'Succeeded' && ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded'))): true + # NOT (any of the 4 pass states) == NOT pass + # Negated via DeMorgan: NOT (A OR B OR C OR D) == (NOT A AND NOT B AND NOT C AND NOT D) + # Encode as: all 4 conditions false => application fails predicate + ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded')): false catch: - script: content: | diff --git a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml index 419dab707..0ea19d90c 100644 --- a/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml +++ b/tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml @@ -32,13 +32,15 @@ # selects the right name from DEPLOYER. # # Pass predicate — every Application in the `argocd` namespace must match -# one of these 4 terminal-pass states (mirroring the pre-migration jq -# disjunction at validate-scheduling.sh:~1378): +# one of these 4 terminal-pass states. Note: operationState.phase==Succeeded +# is validated only on the root Application (assert-root-app-succeeded step) +# to close apply-time races; per-child checks omit it to avoid hanging on +# health-gated apps like kube-prometheus-stack that stay op=Running forever. # -# 1. Synced + Healthy — canonical pass -# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) -# 3. OutOfSync + Healthy + operationState=Succeeded — operator mutation (ClusterPolicy, DeviceClass, etc.) -# 4. Synced + Degraded + operationState=Succeeded — health controller divergence after successful op +# 1. Synced + Healthy — canonical pass +# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008) +# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass, etc.) +# 4. Synced + Degraded — health controller divergence after successful op # # The 4-arm disjunction is encoded as a single JMESPath expression that # Chainsaw re-evaluates per Application matched by the resource selector. @@ -68,11 +70,13 @@ spec: - name: rootApp value: ($values.rootApp || 'aicr-stack') timeouts: - # KWOK_ARGOCD_SYNC_TIMEOUT default. The argocd-helm-oci wrapper takes - # slightly longer to converge than argocd-oci because the parent - # Application has to render the child Apps before they begin - # reconciling. 5 minutes is conservative for KWOK pod-readiness sim. - assert: 5m + # KWOK_ARGOCD_SYNC_TIMEOUT default (480s, matching the bash driver's + # default below). The argocd-helm-oci wrapper takes slightly longer to + # converge than argocd-oci because the parent Application has to render + # the child Apps before they begin reconciling. 8 minutes accommodates + # the additional latency of the more honest gate (all-semantics vs + # exists-semantics) without racing; see issue #1288. + assert: 8m # Skip namespace and resource deletion during cleanup. The test creates # no resources of its own (it only asserts on Application CRs in the # argocd namespace), so there's nothing to clean. Chainsaw's default @@ -105,41 +109,82 @@ spec: echo "--- Applications in argocd namespace ---" kubectl get applications -n argocd 2>&1 || true + - name: assert-root-app-succeeded + description: | + The root Application must reach operationState.phase==Succeeded before + we assert on the entire fleet. This gates the sweep against apply-time + races where kubectl apply creates the root with empty status and the + name-less sweep evaluates before the root has begun reconciliation, + causing a vacuous pass. For an App-of-Apps, this ensures children are + materialized (Argo's sync operation is not Succeeded until the parent + app has recursively applied all children). + try: + - assert: + resource: + apiVersion: argoproj.io/v1alpha1 + kind: Application + metadata: + name: ($rootApp) + namespace: argocd + status: + (operationState.phase == 'Succeeded'): true + catch: + - script: + content: | + echo "--- Root Application status ---" + kubectl get application -n argocd -o yaml ($rootApp) 2>&1 || true + echo "--- argocd-application-controller (tail=200) ---" + kubectl logs -n argocd statefulset/argocd-application-controller --tail=200 2>&1 \ + || kubectl logs -n argocd deploy/argocd-application-controller --tail=200 2>&1 \ + || true + - name: assert-all-applications-pass description: | Every Application in the argocd namespace must satisfy the 4-arm terminal-pass predicate. Chainsaw polls until convergence or the - spec.timeouts.assert deadline fires. + spec.timeouts.assert deadline fires. Uses inverted-polarity error + semantics to assert "all Applications in terminal state" rather than + "at least one Application in terminal state" — the error: block fails + when ANY Application *fails* the predicate (achieves all-semantics), + whereas assert: passes when ANY Application *matches* (exists-semantics). + See ADR-010 "Sync Gate: All-Resources Semantics". + + Field paths are relative to `status:` here (e.g., `sync.status` not + `status.sync.status`). The per-child predicate does NOT include + `operationState.phase == 'Succeeded'` because on KWOK, health-gated + Applications (e.g., kube-prometheus-stack waiting for the kubelet to + report Pod readiness) may have `sync.status==Synced` but their sync + operation stays Running forever (the kubelet never comes up on KWOK). + `sync.status==Synced` alone is sufficient — it implies the sync + completed (or was aborted), and we skip the health-awaiting apps that + would hang the gate. The root check (assert-root-app-succeeded above) + still gates on `operationState.phase==Succeeded` to close the + apply-time race. + + Arms (all require only sync + health state, no operationState check): + # 1. Synced + Healthy — canonical pass + # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) + # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) + # 4. Synced + Degraded — health controller divergence after a successful op + # Note: operationState.phase==Succeeded is gated by assert-root-app-succeeded + # to close apply-time races; per-child checks omit it to avoid hanging on + # health-gated apps like kube-prometheus-stack that stay op=Running forever on KWOK. try: - - assert: + - error: + # Inverted polarity: error: fails when ANY Application does NOT + # match the predicate. This achieves "assert all" semantics. + # Zero Applications is a pass (the sweep matched nothing, so there + # are no failures). resource: apiVersion: argoproj.io/v1alpha1 kind: Application metadata: namespace: argocd status: - # The 4-arm OR is encoded as a single JMESPath expression - # because YAML doesn't accept multi-line mapping keys. - # Field paths are relative to `status:` here (e.g., - # `sync.status` not `status.sync.status`). All four arms - # require `operationState.phase == 'Succeeded'` so that - # an Application can only count as passing AFTER Argo - # CD has completed at least one sync operation. For an - # App-of-Apps, that's the gate ensuring the child - # Applications have been materialized before the - # selector's "every Application in argocd ns" sweep - # accepts the root alone — otherwise the assertion - # races ahead and passes on the root's transient - # Synced state, returning before children even exist. - # See #1061 (and #1050 for the migration that - # introduced the race). - # - # Arms: - # 1. Synced + Healthy — canonical pass - # 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008) - # 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass) - # 4. Synced + Degraded — health controller divergence after a successful op - (operationState.phase == 'Succeeded' && ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded'))): true + # NOT (any of the 4 pass states) == NOT pass + # Negated via DeMorgan: NOT (A OR B OR C OR D) == (NOT A AND NOT B AND NOT C AND NOT D) + # Encode as: all 4 conditions false => application fails predicate + ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded')): false catch: - script: content: |