Skip to content

fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397

Open
rsd-darshan wants to merge 5 commits into
NVIDIA:mainfrom
rsd-darshan:fix/argocd-sync-gate
Open

fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397
rsd-darshan wants to merge 5 commits into
NVIDIA:mainfrom
rsd-darshan:fix/argocd-sync-gate

Conversation

@rsd-darshan

@rsd-darshan rsd-darshan commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes issue #1388. Replaces exist-semantics assert with inverted-polarity error: to enforce that ALL Applications reach terminal-pass state, not just any one.

Changes

  1. Added staged assert-root-app-succeeded step — gates fleet sweep against apply-time races where kubectl apply creates root with empty status and the sweep evaluates before children materialize
  2. Switched assert-all-applications-pass to inverted-polarity error: blocks when ANY Application fails predicate (all-semantics vs exist-semantics)
  3. Removed operationState.phase==Succeeded from per-child check — fixes hang on health-gated apps (e.g., kube-prometheus-stack) that stay Progressing on KWOK but are genuinely synced
  4. Increased timeout from 5m to 8m — gives the all-semantics gate (which waits for every Application, not just the first) more budget than the old exists-semantics check needed
  5. Bumped KWOK_ARGOCD_SYNC_TIMEOUT default from 300 to 480 in validate-scheduling.sh — this env var feeds --assert-timeout, which overrides spec.timeouts.assert at invocation, so the YAML's 8m bump was a no-op until the script's default was raised too

Root Cause

The original code used a name-less assert: block that passed when ANY Application matched the predicate (exist-semantics). Two failure modes emerged:

  • Apply-time race: Root app created with empty status; sweep evaluates before children exist → vacuous pass
  • Health-gate hang: Apps with op=Running waiting for Pod readiness never reach op=Succeeded → timeout

Test Coverage

  • Modified tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml, tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml, kwok/scripts/validate-scheduling.sh, docs/contributor/tests.md
  • No Go code changes
  • Syntax validated

References

Replace exist-semantics assert with inverted-polarity error: to enforce
that ALL Applications reach terminal-pass state, not just any one. Fixes
issue NVIDIA#1288 argocd half (flux half was NVIDIA#1290).

Changes:
- Add staged assert-root-app-succeeded step that gates on
  operationState.phase==Succeeded before fleet sweep, closing apply-time
  race where kubectl apply creates root with empty status
- Switch assert-all-applications-pass from assert: to error: with inverted
  predicate, achieving "all Applications pass" vs "at least one passes"
- Remove operationState.phase==Succeeded from per-child predicate to
  handle health-gated applications (e.g., kube-prometheus-stack) that
  stay Progressing on KWOK (kubelet never comes up) but are genuinely
  synced (sync.status==Synced sufficient)
- Bump timeout from 5m to 8m to match KWOK_FLUX_SYNC_TIMEOUT and give
  the more honest gate adequate time without racing

See ADR-010 "Sync Gate: All-Resources Semantics" and issue NVIDIA#1061 for
operationState.phase rationale.
@rsd-darshan rsd-darshan requested a review from a team as a code owner June 21, 2026 08:49
@copy-pr-bot

copy-pr-bot Bot commented Jun 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @rsd-darshan! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The KWOK Argo CD sync Chainsaw test and its supporting documentation and scripts receive coordinated updates. The assertion timeout is raised from 5 minutes to 8 minutes in the test, with the corresponding KWOK_ARGOCD_SYNC_TIMEOUT default increased from 300 to 480 seconds in the validation script and documentation. A new step, assert-root-app-succeeded, is inserted before the fleet check; it blocks progression until the root Argo CD Application reports status.operationState.phase == Succeeded, with diagnostic dumps on failure. The existing assert-all-applications-pass step is rewritten to use an error: block with an inverted predicate (DeMorgan negation), so the test fails when any Application in the argocd namespace falls outside the four permitted sync/health terminal states; the per-application operationState.phase == Succeeded requirement is removed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1390: Updates the shared tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml sync gate logic that is directly used by the Argo CD reconciliation gating alongside that PR's argocd-git-sync/argocd-git lane changes.

Suggested labels

area/ci, area/docs

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: fixing the argocd-sync gate test to use all-semantics and handle health-gated applications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly relates to the changeset, explaining the fixes for issue #1388, detailing specific changes to test timeouts, assertion logic, and configuration defaults.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml`:
- Around line 132-175: The header comment for the assert-all-applications-pass
step describes the old predicate that includes operationState=Succeeded for arms
3 and 4, but the actual implementation at line 175 and the description at lines
155-159 have been updated to only require sync and health states. Update the
comment bullets describing arm 3 (OutOfSync + Healthy) and arm 4 (Synced +
Degraded) to remove the operationState=Succeeded references so they match the
current implementation.
- Around line 67-70: Update the inline comment at line 69 to reference the
correct environment variable name `KWOK_ARGOCD_SYNC_TIMEOUT` instead of
`KWOK_FLUX_SYNC_TIMEOUT`, as the test is for Argo CD not Flux. Additionally,
verify that the timeout values are coordinated between the YAML assert timeout
(currently 8m) and the `--assert-timeout` flag passed from the validation script
which defaults to 300 seconds (5 minutes). Either set `KWOK_ARGOCD_SYNC_TIMEOUT`
environment variable to 480 seconds to match the 8-minute intent, or update the
YAML timeout value to match the script's default of 5 minutes, and document
which timeout is actually being used in the test execution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c5694d79-86b4-46b6-9b2a-16cb4313e4aa

📥 Commits

Reviewing files that changed from the base of the PR and between eea2a81 and 1aa3bfe.

📒 Files selected for processing (1)
  • tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml

Comment thread tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml Outdated
Comment thread tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml
Explicitly note that operationState.phase==Succeeded is omitted from
per-child checks (only validated on root) to avoid hanging on health-gated
applications like kube-prometheus-stack that stay op=Running forever on KWOK
(kubelet never comes up).
The argocd-git-sync test is a sibling of argocd-sync that MUST stay
byte-identical in assert-root-app-present and assert-all-applications-pass
steps (SYNC NOTE in both files). Apply the same all-semantics fix:

- Add staged assert-root-app-succeeded step
- Switch from assert: to inverted-polarity error:
- Remove operationState.phase==Succeeded from per-child check
- Update timeout from 5m to 8m to match argocd-sync
- Update header comments to reflect new semantics

Ensures both lanes pass only when ALL Applications reach terminal state,
not just any one, and handles health-gated applications that hang on KWOK.
@github-actions github-actions Bot added size/L and removed size/M labels Jun 21, 2026
KWOK_ARGOCD_SYNC_TIMEOUT defaulted to 300s while the chainsaw test's
spec.timeouts.assert was bumped to 8m (480s). Since --assert-timeout
overrides the YAML value at invocation, the env var default was the
one actually in effect, silently undoing the 8m budget the all-semantics
gate needs. Bump the default to 480 and fix a stale comment that
referenced KWOK_FLUX_SYNC_TIMEOUT instead of KWOK_ARGOCD_SYNC_TIMEOUT.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b916f20b-c35d-4509-973f-477d263b5ee3

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa3bfe and 87b7b41.

📒 Files selected for processing (2)
  • docs/contributor/tests.md
  • kwok/scripts/validate-scheduling.sh

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b916f20b-c35d-4509-973f-477d263b5ee3

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa3bfe and 87b7b41.

📒 Files selected for processing (2)
  • docs/contributor/tests.md
  • kwok/scripts/validate-scheduling.sh
🛑 Comments failed to post (2)
kwok/scripts/validate-scheduling.sh (2)

448-450: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add git to dependency checks for argocd-git too.

Line 450 only gates git for flux-git, but argocd-git also executes git in push_bundle_to_gitea() (Line 690+). This can fail late with command not found instead of a clear preflight error.

Suggested patch
-    [[ "$DEPLOYER" == "flux-git" ]] && required+=(git)
+    [[ "$DEPLOYER" == "flux-git" || "$DEPLOYER" == "argocd-git" ]] && required+=(git)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kwok/scripts/validate-scheduling.sh` around lines 448 - 450, The required
dependencies array at line 450 only adds git when DEPLOYER is flux-git, but
argocd-git also executes git commands in the push_bundle_to_gitea function,
which will fail with a unclear command-not-found error if git is missing. Add an
additional condition to also append git to the required array when DEPLOYER
equals argocd-git, ensuring both flux-git and argocd-git deployments have git as
a preflight requirement check.

1890-1891: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

usage() omits argocd-git from supported deployers.

The help text is now out of sync with the accepted values in Line 1943, which can mislead users invoking the new lane.

Suggested patch
-  --deployer <name>       One of: helm (default), argocd-oci, argocd-helm-oci,
-                          flux-oci, flux-git.
+  --deployer <name>       One of: helm (default), argocd-oci, argocd-helm-oci,
+                          argocd-git, flux-oci, flux-git.
@@
   validate-scheduling.sh --deployer argocd-oci eks-training
   validate-scheduling.sh --deployer argocd-helm-oci eks-training
+  validate-scheduling.sh --deployer argocd-git eks-training
   validate-scheduling.sh --deployer flux-oci eks-training
   validate-scheduling.sh --deployer flux-git eks-training

Also applies to: 1893-1899

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kwok/scripts/validate-scheduling.sh` around lines 1890 - 1891, The usage()
function's help text for the --deployer option is missing the argocd-git
deployer from the list of supported values, causing the documentation to be
inconsistent with the actual accepted values validated in the script. Update the
deployer description in the usage() function to include argocd-git in the
comma-separated list of supported deployers (helm, argocd-oci, argocd-helm-oci,
argocd-git, flux-oci, flux-git) so the help text accurately reflects all
available options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant