fix: stop re-running restore and one-shot backup jobs after completion by sionsmith · Pull Request #32 · osodevops/strimzi-backup-operator

sionsmith · 2026-06-12T05:56:47Z

Fixes #29.

Root cause

Three interacting defects:

is_job_running() only counted Jobs with status.active > 0. A completed Job has active=0, so the reconciler read it as "no job running" and created a fresh timestamped Job on every requeue.
Job creation ran before the completion check, so on the first reconcile after the Job finished, the duplicate was created before the status ever flipped to RestoreComplete.
The controllers did not watch Jobs and requeued every 300s — which is why the reporter's duplicate Jobs are exactly 5 minutes apart, and why the status sat at RestoreRunning long after the Job completed.

Successful restores were bounded at one duplicate (the RestoreComplete guard then kicks in), but a terminally failed restore Job was re-created every 5 minutes forever, and the one-shot KafkaBackup path had the identical defect. For a real (non-dryRun) restore the duplicate re-applies the restore to the target cluster.

Fix

New jobs::job_state::classify_jobs() aggregates the full set of Jobs for a CR into Succeeded / InProgress / Failed / NoJobs, using Job conditions (Complete/Failed) so pod retries within backoffLimit stay InProgress, and a just-created Job with no status yet blocks duplicate creation.
Job creation is gated on that classification: restores are strictly one-shot; one-shot backups only re-run via the kafkabackup.com/trigger=now annotation (which still never stacks onto an active Job).
Both controllers now .owns() their Jobs, so completion/failure updates the CR status within seconds instead of after the next periodic requeue. RBAC already grants watch on batch/jobs; no CRD changes.
Terminally failed restore Jobs now report a RestoreFailed condition instead of being silently re-created.
Status patches are idempotent (Job-derived timestamps, skip when the outcome is already recorded) so the new watch cannot churn lastTransitionTime in a reconcile loop.

Verification

Reproduced the issue first on a kind cluster with the v0.2.8 operator and a stub job image: Job completed at +0:29, status stuck at RestoreRunning, duplicate Job created at exactly +5:00 — matching the report byte-for-byte.

TDD: the new job_state tests were written against the old decision logic first — 6 of 11 failed (succeeded/failed/pending Jobs all invisible) — then the classification was fixed; all 31 tests now pass, plus clippy/fmt/helm template clean.
E2E with the fixed operator against the same cluster, watched for 8.5 minutes (past two requeue cycles):
- success case: exactly 1 Job; status RestoreCompleted ~0.5s after Job completion; no duplicate at +5/+9 min
- failure case: exactly 1 Job; pod retries stayed RestoreRunning; terminal Ready=False/RestoreFailed + Error=True/RestoreFailed after backoffLimit; no re-creation
- one-shot backup: exactly 1 Job, BackupCompleted within seconds, no re-run

🤖 Generated with Claude Code

A finished Job has status.active=0, which is_job_running() read as "no job running", so every 5-minute requeue created a fresh restore Job — and the completion check ran only after the creation step, so the duplicate was already spawned by the time the status flipped to RestoreComplete. Terminally failed Jobs were re-created forever, and the one-shot KafkaBackup path had the identical defect. - Add jobs::job_state::classify_jobs(): aggregate the full set of Jobs for a CR into Succeeded/InProgress/Failed/NoJobs using Job conditions (Complete/Failed), so pod retries within backoffLimit stay InProgress and a pending Job with no status yet blocks duplicate creation. - Gate Job creation on that classification: restores never create a second Job; one-shot backups only re-run via the manual trigger annotation, which still never stacks onto an active Job. - Watch owned Jobs (.owns) in both controllers so status reflects completion/failure within seconds instead of the next requeue. - Report a terminal RestoreFailed condition for failed restore Jobs instead of silently re-creating them. - Make status patches idempotent (Job-derived timestamps, skip when the condition/lastBackup already reflects the outcome) so the new watch cannot churn lastTransitionTime in a reconcile loop. Fixes #29 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

sionsmith mentioned this pull request Jun 12, 2026

fix: Background-propagation cleanup (#30) + configurable backoffLimit (#31) #33

Merged

sionsmith merged commit 017d23b into main Jun 12, 2026
7 checks passed

sionsmith mentioned this pull request Jun 12, 2026

KafkaRestore re-running #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop re-running restore and one-shot backup jobs after completion#32

fix: stop re-running restore and one-shot backup jobs after completion#32
sionsmith merged 1 commit into
mainfrom
fix/issue-29-restore-rerun

sionsmith commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sionsmith commented Jun 12, 2026

Root cause

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant