Skip to content

fix: stop re-running restore and one-shot backup jobs after completion#32

Merged
sionsmith merged 1 commit into
mainfrom
fix/issue-29-restore-rerun
Jun 12, 2026
Merged

fix: stop re-running restore and one-shot backup jobs after completion#32
sionsmith merged 1 commit into
mainfrom
fix/issue-29-restore-rerun

Conversation

@sionsmith

Copy link
Copy Markdown
Contributor

Fixes #29.

Root cause

Three interacting defects:

  1. is_job_running() only counted Jobs with status.active > 0. A completed Job has active=0, so the reconciler read it as "no job running" and created a fresh timestamped Job on every requeue.
  2. Job creation ran before the completion check, so on the first reconcile after the Job finished, the duplicate was created before the status ever flipped to RestoreComplete.
  3. The controllers did not watch Jobs and requeued every 300s — which is why the reporter's duplicate Jobs are exactly 5 minutes apart, and why the status sat at RestoreRunning long after the Job completed.

Successful restores were bounded at one duplicate (the RestoreComplete guard then kicks in), but a terminally failed restore Job was re-created every 5 minutes forever, and the one-shot KafkaBackup path had the identical defect. For a real (non-dryRun) restore the duplicate re-applies the restore to the target cluster.

Fix

  • New jobs::job_state::classify_jobs() aggregates the full set of Jobs for a CR into Succeeded / InProgress / Failed / NoJobs, using Job conditions (Complete/Failed) so pod retries within backoffLimit stay InProgress, and a just-created Job with no status yet blocks duplicate creation.
  • Job creation is gated on that classification: restores are strictly one-shot; one-shot backups only re-run via the kafkabackup.com/trigger=now annotation (which still never stacks onto an active Job).
  • Both controllers now .owns() their Jobs, so completion/failure updates the CR status within seconds instead of after the next periodic requeue. RBAC already grants watch on batch/jobs; no CRD changes.
  • Terminally failed restore Jobs now report a RestoreFailed condition instead of being silently re-created.
  • Status patches are idempotent (Job-derived timestamps, skip when the outcome is already recorded) so the new watch cannot churn lastTransitionTime in a reconcile loop.

Verification

Reproduced the issue first on a kind cluster with the v0.2.8 operator and a stub job image: Job completed at +0:29, status stuck at RestoreRunning, duplicate Job created at exactly +5:00 — matching the report byte-for-byte.

  • TDD: the new job_state tests were written against the old decision logic first — 6 of 11 failed (succeeded/failed/pending Jobs all invisible) — then the classification was fixed; all 31 tests now pass, plus clippy/fmt/helm template clean.
  • E2E with the fixed operator against the same cluster, watched for 8.5 minutes (past two requeue cycles):
    • success case: exactly 1 Job; status RestoreCompleted ~0.5s after Job completion; no duplicate at +5/+9 min
    • failure case: exactly 1 Job; pod retries stayed RestoreRunning; terminal Ready=False/RestoreFailed + Error=True/RestoreFailed after backoffLimit; no re-creation
    • one-shot backup: exactly 1 Job, BackupCompleted within seconds, no re-run

🤖 Generated with Claude Code

A finished Job has status.active=0, which is_job_running() read as "no
job running", so every 5-minute requeue created a fresh restore Job —
and the completion check ran only after the creation step, so the
duplicate was already spawned by the time the status flipped to
RestoreComplete. Terminally failed Jobs were re-created forever, and
the one-shot KafkaBackup path had the identical defect.

- Add jobs::job_state::classify_jobs(): aggregate the full set of Jobs
  for a CR into Succeeded/InProgress/Failed/NoJobs using Job conditions
  (Complete/Failed), so pod retries within backoffLimit stay InProgress
  and a pending Job with no status yet blocks duplicate creation.
- Gate Job creation on that classification: restores never create a
  second Job; one-shot backups only re-run via the manual trigger
  annotation, which still never stacks onto an active Job.
- Watch owned Jobs (.owns) in both controllers so status reflects
  completion/failure within seconds instead of the next requeue.
- Report a terminal RestoreFailed condition for failed restore Jobs
  instead of silently re-creating them.
- Make status patches idempotent (Job-derived timestamps, skip when the
  condition/lastBackup already reflects the outcome) so the new watch
  cannot churn lastTransitionTime in a reconcile loop.

Fixes #29

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@sionsmith sionsmith merged commit 017d23b into main Jun 12, 2026
7 checks passed
@sionsmith sionsmith mentioned this pull request Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KafkaRestore re-running

1 participant