fix: stop re-running restore and one-shot backup jobs after completion#32
Merged
Conversation
A finished Job has status.active=0, which is_job_running() read as "no job running", so every 5-minute requeue created a fresh restore Job — and the completion check ran only after the creation step, so the duplicate was already spawned by the time the status flipped to RestoreComplete. Terminally failed Jobs were re-created forever, and the one-shot KafkaBackup path had the identical defect. - Add jobs::job_state::classify_jobs(): aggregate the full set of Jobs for a CR into Succeeded/InProgress/Failed/NoJobs using Job conditions (Complete/Failed), so pod retries within backoffLimit stay InProgress and a pending Job with no status yet blocks duplicate creation. - Gate Job creation on that classification: restores never create a second Job; one-shot backups only re-run via the manual trigger annotation, which still never stacks onto an active Job. - Watch owned Jobs (.owns) in both controllers so status reflects completion/failure within seconds instead of the next requeue. - Report a terminal RestoreFailed condition for failed restore Jobs instead of silently re-creating them. - Make status patches idempotent (Job-derived timestamps, skip when the condition/lastBackup already reflects the outcome) so the new watch cannot churn lastTransitionTime in a reconcile loop. Fixes #29 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #29.
Root cause
Three interacting defects:
is_job_running()only counted Jobs withstatus.active > 0. A completed Job hasactive=0, so the reconciler read it as "no job running" and created a fresh timestamped Job on every requeue.RestoreComplete.RestoreRunninglong after the Job completed.Successful restores were bounded at one duplicate (the
RestoreCompleteguard then kicks in), but a terminally failed restore Job was re-created every 5 minutes forever, and the one-shotKafkaBackuppath had the identical defect. For a real (non-dryRun) restore the duplicate re-applies the restore to the target cluster.Fix
jobs::job_state::classify_jobs()aggregates the full set of Jobs for a CR intoSucceeded/InProgress/Failed/NoJobs, using Job conditions (Complete/Failed) so pod retries withinbackoffLimitstayInProgress, and a just-created Job with no status yet blocks duplicate creation.kafkabackup.com/trigger=nowannotation (which still never stacks onto an active Job)..owns()their Jobs, so completion/failure updates the CR status within seconds instead of after the next periodic requeue. RBAC already grantswatchonbatch/jobs; no CRD changes.RestoreFailedcondition instead of being silently re-created.lastTransitionTimein a reconcile loop.Verification
Reproduced the issue first on a kind cluster with the v0.2.8 operator and a stub job image: Job completed at +0:29, status stuck at
RestoreRunning, duplicate Job created at exactly +5:00 — matching the report byte-for-byte.job_statetests were written against the old decision logic first — 6 of 11 failed (succeeded/failed/pending Jobs all invisible) — then the classification was fixed; all 31 tests now pass, plus clippy/fmt/helm template clean.RestoreCompleted~0.5s after Job completion; no duplicate at +5/+9 minRestoreRunning; terminalReady=False/RestoreFailed+Error=True/RestoreFailedafter backoffLimit; no re-creationBackupCompletedwithin seconds, no re-run🤖 Generated with Claude Code