Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/contributor/tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,7 @@ independent.

| Variable | Default | Purpose |
|----------|---------|---------|
| `KWOK_ARGOCD_SYNC_TIMEOUT` | `300` s | Deadline for all child Argo CD Applications to reach `Synced+Healthy` |
| `KWOK_ARGOCD_SYNC_TIMEOUT` | `480` s | Deadline for all child Argo CD Applications to reach `Synced+Healthy` |
| `KWOK_ARGOCD_ROOT_GRACE` | `30` s | Grace period for the root Application before deadline counting starts |
| `KWOK_FLUX_SYNC_TIMEOUT` | `500` s | Deadline for source fetch (OCIRepository or GitRepository) + Kustomization apply + HelmReleases `Ready=True` + ArtifactGenerators Ready |
| `KWOK_FLUX_ROOT_GRACE` | `30` s | Grace period for the outer Kustomization before deadline counting starts |
Expand All @@ -385,7 +385,9 @@ credential for the ephemeral in-cluster Gitea, not a secret.
`argocd-git` reuses the `KWOK_ARGOCD_SYNC_TIMEOUT` budget.

On a clean local Kind cluster `Synced+Healthy` lands in ~30 s; the
300-second default exists to absorb CI variance. If a local run trips
480-second default exists to absorb CI variance (the all-semantics gate
waits for every Application, not just the first one, so it needs more
budget than the old exists-semantics check did). If a local run trips
code 50 but the cluster is otherwise healthy, raise the relevant
timeout before assuming the recipe is broken — cold-cluster image
pulls are the most common cause.
Expand Down
5 changes: 3 additions & 2 deletions kwok/scripts/validate-scheduling.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@
# free as well).
# KWOK_ARGOCD_SYNC_TIMEOUT Seconds to wait for all Argo CD Applications to
# reach Synced + Healthy (or Progressing) before
# failing. Default: 300.
# failing. Default: 480 (matches the chainsaw test's
# spec.timeouts.assert: 8m).
# KWOK_ARGOCD_ROOT_GRACE Seconds to wait for the root Argo CD Application
# (nvidia-stack for argocd-oci, aicr-stack for
# argocd-helm-oci) to appear in the argocd
Expand Down Expand Up @@ -1552,7 +1553,7 @@ wait_for_argocd_sync() {
test_dir="${REPO_ROOT}/tests/chainsaw/kwok/argocd-sync"
;;
esac
local sync_timeout="${KWOK_ARGOCD_SYNC_TIMEOUT:-300}s"
local sync_timeout="${KWOK_ARGOCD_SYNC_TIMEOUT:-480}s"

log_info "Argo CD sync gate (chainsaw): rootApp=${ARGOCD_ROOT_APP} timeout=${sync_timeout}"

Expand Down
108 changes: 77 additions & 31 deletions tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,16 @@
# which is exactly what Argo CD's repo-server clones from.
#
# Pass predicate — every Application in the `argocd` namespace must match
# one of these 4 terminal-pass states (mirroring argocd-sync):
# one of these 4 terminal-pass states (mirroring argocd-sync). Note:
# operationState.phase==Succeeded is validated only on the root Application
# (assert-root-app-succeeded step) to close apply-time races; per-child
# checks omit it to avoid hanging on health-gated apps like
# kube-prometheus-stack that stay op=Running forever.
#
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008)
# 3. OutOfSync + Healthy + operationState=Succeeded — operator mutation (ClusterPolicy, DeviceClass, etc.)
# 4. Synced + Degraded + operationState=Succeeded — health controller divergence after successful op
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008)
# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass, etc.)
# 4. Synced + Degraded — health controller divergence after successful op
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
Expand All @@ -58,8 +62,9 @@ spec:
- name: repoURL
value: ($values.repoURL || '')
timeouts:
# KWOK_ARGOCD_SYNC_TIMEOUT default. Matches argocd-sync.
assert: 5m
# KWOK_ARGOCD_SYNC_TIMEOUT default. Matches argocd-sync (8 minutes for
# all-semantics gate; see issue #1288).
assert: 8m
# See argocd-sync for the skipDelete rationale (KWOK has no
# kube-controller-manager to finalize the auto-created test namespace).
skipDelete: true
Expand Down Expand Up @@ -115,41 +120,82 @@ spec:
echo "--- Applications in argocd namespace ---"
kubectl get applications -n argocd 2>&1 || true

- name: assert-root-app-succeeded
description: |
The root Application must reach operationState.phase==Succeeded before
we assert on the entire fleet. This gates the sweep against apply-time
races where kubectl apply creates the root with empty status and the
name-less sweep evaluates before the root has begun reconciliation,
causing a vacuous pass. For an App-of-Apps, this ensures children are
materialized (Argo's sync operation is not Succeeded until the parent
app has recursively applied all children).
try:
- assert:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ($rootApp)
namespace: argocd
status:
(operationState.phase == 'Succeeded'): true
catch:
- script:
content: |
echo "--- Root Application status ---"
kubectl get application -n argocd -o yaml ($rootApp) 2>&1 || true
echo "--- argocd-application-controller (tail=200) ---"
kubectl logs -n argocd statefulset/argocd-application-controller --tail=200 2>&1 \
|| kubectl logs -n argocd deploy/argocd-application-controller --tail=200 2>&1 \
|| true

- name: assert-all-applications-pass
description: |
Every Application in the argocd namespace must satisfy the 4-arm
terminal-pass predicate. Chainsaw polls until convergence or the
spec.timeouts.assert deadline fires.
spec.timeouts.assert deadline fires. Uses inverted-polarity error
semantics to assert "all Applications in terminal state" rather than
"at least one Application in terminal state" — the error: block fails
when ANY Application *fails* the predicate (achieves all-semantics),
whereas assert: passes when ANY Application *matches* (exists-semantics).
See ADR-010 "Sync Gate: All-Resources Semantics".

Field paths are relative to `status:` here (e.g., `sync.status` not
`status.sync.status`). The per-child predicate does NOT include
`operationState.phase == 'Succeeded'` because on KWOK, health-gated
Applications (e.g., kube-prometheus-stack waiting for the kubelet to
report Pod readiness) may have `sync.status==Synced` but their sync
operation stays Running forever (the kubelet never comes up on KWOK).
`sync.status==Synced` alone is sufficient — it implies the sync
completed (or was aborted), and we skip the health-awaiting apps that
would hang the gate. The root check (assert-root-app-succeeded above)
still gates on `operationState.phase==Succeeded` to close the
apply-time race.

Arms (all require only sync + health state, no operationState check):
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008)
# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass)
# 4. Synced + Degraded — health controller divergence after a successful op
# Note: operationState.phase==Succeeded is gated by assert-root-app-succeeded
# to close apply-time races; per-child checks omit it to avoid hanging on
# health-gated apps like kube-prometheus-stack that stay op=Running forever on KWOK.
try:
- assert:
- error:
# Inverted polarity: error: fails when ANY Application does NOT
# match the predicate. This achieves "assert all" semantics.
# Zero Applications is a pass (the sweep matched nothing, so there
# are no failures).
resource:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
namespace: argocd
status:
# The 4-arm OR is encoded as a single JMESPath expression
# because YAML doesn't accept multi-line mapping keys.
# Field paths are relative to `status:` here (e.g.,
# `sync.status` not `status.sync.status`). All four arms
# require `operationState.phase == 'Succeeded'` so that
# an Application can only count as passing AFTER Argo
# CD has completed at least one sync operation. For an
# App-of-Apps, that's the gate ensuring the child
# Applications have been materialized before the
# selector's "every Application in argocd ns" sweep
# accepts the root alone — otherwise the assertion
# races ahead and passes on the root's transient
# Synced state, returning before children even exist.
# See #1061 (and #1050 for the migration that
# introduced the race).
#
# Arms:
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008)
# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass)
# 4. Synced + Degraded — health controller divergence after a successful op
(operationState.phase == 'Succeeded' && ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded'))): true
# NOT (any of the 4 pass states) == NOT pass
# Negated via DeMorgan: NOT (A OR B OR C OR D) == (NOT A AND NOT B AND NOT C AND NOT D)
# Encode as: all 4 conditions false => application fails predicate
((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded')): false
catch:
- script:
content: |
Expand Down
115 changes: 80 additions & 35 deletions tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,15 @@
# selects the right name from DEPLOYER.
#
# Pass predicate — every Application in the `argocd` namespace must match
# one of these 4 terminal-pass states (mirroring the pre-migration jq
# disjunction at validate-scheduling.sh:~1378):
# one of these 4 terminal-pass states. Note: operationState.phase==Succeeded
# is validated only on the root Application (assert-root-app-succeeded step)
# to close apply-time races; per-child checks omit it to avoid hanging on
# health-gated apps like kube-prometheus-stack that stay op=Running forever.
#
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008)
# 3. OutOfSync + Healthy + operationState=Succeeded — operator mutation (ClusterPolicy, DeviceClass, etc.)
# 4. Synced + Degraded + operationState=Succeeded — health controller divergence after successful op
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness gap (ADR-008)
# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass, etc.)
# 4. Synced + Degraded — health controller divergence after successful op
#
# The 4-arm disjunction is encoded as a single JMESPath expression that
# Chainsaw re-evaluates per Application matched by the resource selector.
Expand Down Expand Up @@ -68,11 +70,13 @@ spec:
- name: rootApp
value: ($values.rootApp || 'aicr-stack')
timeouts:
# KWOK_ARGOCD_SYNC_TIMEOUT default. The argocd-helm-oci wrapper takes
# slightly longer to converge than argocd-oci because the parent
# Application has to render the child Apps before they begin
# reconciling. 5 minutes is conservative for KWOK pod-readiness sim.
assert: 5m
# KWOK_ARGOCD_SYNC_TIMEOUT default (480s, matching the bash driver's
# default below). The argocd-helm-oci wrapper takes slightly longer to
# converge than argocd-oci because the parent Application has to render
# the child Apps before they begin reconciling. 8 minutes accommodates
# the additional latency of the more honest gate (all-semantics vs
# exists-semantics) without racing; see issue #1288.
assert: 8m
# Skip namespace and resource deletion during cleanup. The test creates
# no resources of its own (it only asserts on Application CRs in the
# argocd namespace), so there's nothing to clean. Chainsaw's default
Expand Down Expand Up @@ -105,41 +109,82 @@ spec:
echo "--- Applications in argocd namespace ---"
kubectl get applications -n argocd 2>&1 || true

- name: assert-root-app-succeeded
description: |
The root Application must reach operationState.phase==Succeeded before
we assert on the entire fleet. This gates the sweep against apply-time
races where kubectl apply creates the root with empty status and the
name-less sweep evaluates before the root has begun reconciliation,
causing a vacuous pass. For an App-of-Apps, this ensures children are
materialized (Argo's sync operation is not Succeeded until the parent
app has recursively applied all children).
try:
- assert:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ($rootApp)
namespace: argocd
status:
(operationState.phase == 'Succeeded'): true
catch:
- script:
content: |
echo "--- Root Application status ---"
kubectl get application -n argocd -o yaml ($rootApp) 2>&1 || true
echo "--- argocd-application-controller (tail=200) ---"
kubectl logs -n argocd statefulset/argocd-application-controller --tail=200 2>&1 \
|| kubectl logs -n argocd deploy/argocd-application-controller --tail=200 2>&1 \
|| true

- name: assert-all-applications-pass
description: |
Every Application in the argocd namespace must satisfy the 4-arm
terminal-pass predicate. Chainsaw polls until convergence or the
spec.timeouts.assert deadline fires.
spec.timeouts.assert deadline fires. Uses inverted-polarity error
semantics to assert "all Applications in terminal state" rather than
"at least one Application in terminal state" — the error: block fails
when ANY Application *fails* the predicate (achieves all-semantics),
whereas assert: passes when ANY Application *matches* (exists-semantics).
See ADR-010 "Sync Gate: All-Resources Semantics".

Field paths are relative to `status:` here (e.g., `sync.status` not
`status.sync.status`). The per-child predicate does NOT include
`operationState.phase == 'Succeeded'` because on KWOK, health-gated
Applications (e.g., kube-prometheus-stack waiting for the kubelet to
report Pod readiness) may have `sync.status==Synced` but their sync
operation stays Running forever (the kubelet never comes up on KWOK).
`sync.status==Synced` alone is sufficient — it implies the sync
completed (or was aborted), and we skip the health-awaiting apps that
would hang the gate. The root check (assert-root-app-succeeded above)
still gates on `operationState.phase==Succeeded` to close the
apply-time race.

Arms (all require only sync + health state, no operationState check):
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008)
# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass)
# 4. Synced + Degraded — health controller divergence after a successful op
# Note: operationState.phase==Succeeded is gated by assert-root-app-succeeded
# to close apply-time races; per-child checks omit it to avoid hanging on
# health-gated apps like kube-prometheus-stack that stay op=Running forever on KWOK.
try:
- assert:
- error:
# Inverted polarity: error: fails when ANY Application does NOT
# match the predicate. This achieves "assert all" semantics.
# Zero Applications is a pass (the sweep matched nothing, so there
# are no failures).
resource:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
namespace: argocd
status:
# The 4-arm OR is encoded as a single JMESPath expression
# because YAML doesn't accept multi-line mapping keys.
# Field paths are relative to `status:` here (e.g.,
# `sync.status` not `status.sync.status`). All four arms
# require `operationState.phase == 'Succeeded'` so that
# an Application can only count as passing AFTER Argo
# CD has completed at least one sync operation. For an
# App-of-Apps, that's the gate ensuring the child
# Applications have been materialized before the
# selector's "every Application in argocd ns" sweep
# accepts the root alone — otherwise the assertion
# races ahead and passes on the root's transient
# Synced state, returning before children even exist.
# See #1061 (and #1050 for the migration that
# introduced the race).
#
# Arms:
# 1. Synced + Healthy — canonical pass
# 2. Synced + Progressing — KWOK pod-readiness sim gap (ADR-008)
# 3. OutOfSync + Healthy — operator mutation (ClusterPolicy, DeviceClass)
# 4. Synced + Degraded — health controller divergence after a successful op
(operationState.phase == 'Succeeded' && ((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded'))): true
# NOT (any of the 4 pass states) == NOT pass
# Negated via DeMorgan: NOT (A OR B OR C OR D) == (NOT A AND NOT B AND NOT C AND NOT D)
# Encode as: all 4 conditions false => application fails predicate
((sync.status == 'Synced' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Progressing') || (sync.status == 'OutOfSync' && health.status == 'Healthy') || (sync.status == 'Synced' && health.status == 'Degraded')): false
Comment thread
rsd-darshan marked this conversation as resolved.
catch:
- script:
content: |
Expand Down