fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397
fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397rsd-darshan wants to merge 5 commits into
Conversation
Replace exist-semantics assert with inverted-polarity error: to enforce that ALL Applications reach terminal-pass state, not just any one. Fixes issue NVIDIA#1288 argocd half (flux half was NVIDIA#1290). Changes: - Add staged assert-root-app-succeeded step that gates on operationState.phase==Succeeded before fleet sweep, closing apply-time race where kubectl apply creates root with empty status - Switch assert-all-applications-pass from assert: to error: with inverted predicate, achieving "all Applications pass" vs "at least one passes" - Remove operationState.phase==Succeeded from per-child predicate to handle health-gated applications (e.g., kube-prometheus-stack) that stay Progressing on KWOK (kubelet never comes up) but are genuinely synced (sync.status==Synced sufficient) - Bump timeout from 5m to 8m to match KWOK_FLUX_SYNC_TIMEOUT and give the more honest gate adequate time without racing See ADR-010 "Sync Gate: All-Resources Semantics" and issue NVIDIA#1061 for operationState.phase rationale.
|
Welcome to AICR, @rsd-darshan! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
📝 WalkthroughWalkthroughThe KWOK Argo CD sync Chainsaw test and its supporting documentation and scripts receive coordinated updates. The assertion timeout is raised from 5 minutes to 8 minutes in the test, with the corresponding KWOK_ARGOCD_SYNC_TIMEOUT default increased from 300 to 480 seconds in the validation script and documentation. A new step, Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml`:
- Around line 132-175: The header comment for the assert-all-applications-pass
step describes the old predicate that includes operationState=Succeeded for arms
3 and 4, but the actual implementation at line 175 and the description at lines
155-159 have been updated to only require sync and health states. Update the
comment bullets describing arm 3 (OutOfSync + Healthy) and arm 4 (Synced +
Degraded) to remove the operationState=Succeeded references so they match the
current implementation.
- Around line 67-70: Update the inline comment at line 69 to reference the
correct environment variable name `KWOK_ARGOCD_SYNC_TIMEOUT` instead of
`KWOK_FLUX_SYNC_TIMEOUT`, as the test is for Argo CD not Flux. Additionally,
verify that the timeout values are coordinated between the YAML assert timeout
(currently 8m) and the `--assert-timeout` flag passed from the validation script
which defaults to 300 seconds (5 minutes). Either set `KWOK_ARGOCD_SYNC_TIMEOUT`
environment variable to 480 seconds to match the 8-minute intent, or update the
YAML timeout value to match the script's default of 5 minutes, and document
which timeout is actually being used in the test execution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c5694d79-86b4-46b6-9b2a-16cb4313e4aa
📒 Files selected for processing (1)
tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml
Explicitly note that operationState.phase==Succeeded is omitted from per-child checks (only validated on root) to avoid hanging on health-gated applications like kube-prometheus-stack that stay op=Running forever on KWOK (kubelet never comes up).
The argocd-git-sync test is a sibling of argocd-sync that MUST stay byte-identical in assert-root-app-present and assert-all-applications-pass steps (SYNC NOTE in both files). Apply the same all-semantics fix: - Add staged assert-root-app-succeeded step - Switch from assert: to inverted-polarity error: - Remove operationState.phase==Succeeded from per-child check - Update timeout from 5m to 8m to match argocd-sync - Update header comments to reflect new semantics Ensures both lanes pass only when ALL Applications reach terminal state, not just any one, and handles health-gated applications that hang on KWOK.
KWOK_ARGOCD_SYNC_TIMEOUT defaulted to 300s while the chainsaw test's spec.timeouts.assert was bumped to 8m (480s). Since --assert-timeout overrides the YAML value at invocation, the env var default was the one actually in effect, silently undoing the 8m budget the all-semantics gate needs. Bump the default to 480 and fix a stale comment that referenced KWOK_FLUX_SYNC_TIMEOUT instead of KWOK_ARGOCD_SYNC_TIMEOUT.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b916f20b-c35d-4509-973f-477d263b5ee3
📒 Files selected for processing (2)
docs/contributor/tests.mdkwok/scripts/validate-scheduling.sh
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b916f20b-c35d-4509-973f-477d263b5ee3
📒 Files selected for processing (2)
docs/contributor/tests.mdkwok/scripts/validate-scheduling.sh
🛑 Comments failed to post (2)
kwok/scripts/validate-scheduling.sh (2)
448-450:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
gitto dependency checks forargocd-gittoo.Line 450 only gates
gitforflux-git, butargocd-gitalso executesgitinpush_bundle_to_gitea()(Line 690+). This can fail late withcommand not foundinstead of a clear preflight error.Suggested patch
- [[ "$DEPLOYER" == "flux-git" ]] && required+=(git) + [[ "$DEPLOYER" == "flux-git" || "$DEPLOYER" == "argocd-git" ]] && required+=(git)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kwok/scripts/validate-scheduling.sh` around lines 448 - 450, The required dependencies array at line 450 only adds git when DEPLOYER is flux-git, but argocd-git also executes git commands in the push_bundle_to_gitea function, which will fail with a unclear command-not-found error if git is missing. Add an additional condition to also append git to the required array when DEPLOYER equals argocd-git, ensuring both flux-git and argocd-git deployments have git as a preflight requirement check.
1890-1891:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
usage()omitsargocd-gitfrom supported deployers.The help text is now out of sync with the accepted values in Line 1943, which can mislead users invoking the new lane.
Suggested patch
- --deployer <name> One of: helm (default), argocd-oci, argocd-helm-oci, - flux-oci, flux-git. + --deployer <name> One of: helm (default), argocd-oci, argocd-helm-oci, + argocd-git, flux-oci, flux-git. @@ validate-scheduling.sh --deployer argocd-oci eks-training validate-scheduling.sh --deployer argocd-helm-oci eks-training + validate-scheduling.sh --deployer argocd-git eks-training validate-scheduling.sh --deployer flux-oci eks-training validate-scheduling.sh --deployer flux-git eks-trainingAlso applies to: 1893-1899
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kwok/scripts/validate-scheduling.sh` around lines 1890 - 1891, The usage() function's help text for the --deployer option is missing the argocd-git deployer from the list of supported values, causing the documentation to be inconsistent with the actual accepted values validated in the script. Update the deployer description in the usage() function to include argocd-git in the comma-separated list of supported deployers (helm, argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text accurately reflects all available options.
Summary
Fixes issue #1388. Replaces exist-semantics assert with inverted-polarity error: to enforce that ALL Applications reach terminal-pass state, not just any one.
Changes
KWOK_ARGOCD_SYNC_TIMEOUTdefault from 300 to 480 invalidate-scheduling.sh— this env var feeds--assert-timeout, which overridesspec.timeouts.assertat invocation, so the YAML's 8m bump was a no-op until the script's default was raised tooRoot Cause
The original code used a name-less
assert:block that passed when ANY Application matched the predicate (exist-semantics). Two failure modes emerged:Test Coverage
tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml,tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml,kwok/scripts/validate-scheduling.sh,docs/contributor/tests.mdReferences