Skip to content

fix: bound retry metadata growth in queue job serializer (#3947)#1

Open
neuralmint wants to merge 1 commit into
mainfrom
fix/bound-retry-metadata-growth
Open

fix: bound retry metadata growth in queue job serializer (#3947)#1
neuralmint wants to merge 1 commit into
mainfrom
fix/bound-retry-metadata-growth

Conversation

@neuralmint
Copy link
Copy Markdown
Owner

Summary

Fixes #3947 — Bound retry metadata growth when repeated failures are triggered while an agent run, task, or handler is changing lifecycle state.

Problem

The TaskScheduler.fail() method incremented task["retries"] without any upper bound. When the repeated-failures path was exercised, retry metadata could grow unbounded, causing the component to accept stale, duplicate, or policy-violating lifecycle transitions. Impact: jobs could be duplicated, starved, lost, or delivered under wrong conditions.

Fix

  1. Hard cap on retry metadata (MAX_RETRY_METADATA = 100): prevents unbounded growth. fail() caps the increment; if already at the cap, it remains there (idempotent).
  2. Dead letter store: tasks that exceed max_retries or the hard cap are moved to _dead_letter instead of being silently dropped. Logs explain the decision.
  3. enqueue invariant enforcement: enqueue() now rejects tasks whose retry count already exceeds the hard cap, returning None.
  4. Idempotent retry path: added preserve_retries parameter to enqueue() so that the fail-re-enqueue cycle preserves retry metadata instead of resetting it.
  5. Scheduled task promotion (dequeue) also respects the hard cap.
  6. Logging: all retry/rejection decisions are logged at appropriate levels (info for retries, warning for permanent failures).

Acceptance Criteria

  • Deterministic regression test covers the repeated-failures trigger
  • Queue job serializer rejects invalid transitions and preserves expected lifecycle state
  • Logs explain decisions without exposing private runtime data
  • Existing test suite passes unchanged

Proof

$ pytest tests/test_scheduler.py -v
========================== 10 passed in 0.13s ==========================

Closes #3947

Bounty #3947 — Bound retry metadata growth on repeated failures.

Changes:
- Added MAX_RETRY_METADATA hard cap (100) to prevent unbounded retry
  counter growth.
- Added dead_letter store for permanently failed tasks.
- fail() now enforces the repeated-failures invariant before re-enqueueing:
  tasks that exceed max_retries or the hard cap go to dead letter instead.
- enqueue() rejects tasks past the hard cap and returns None for the
  caller to handle.
- Added preserve_retries parameter to enqueue() so retry metadata is
  preserved during re-enqueue (idempotent retry path).
- Scheduled task promotion (dequeue) also respects the hard cap.
- Added detailed logging for all retry/rejection decisions.
- Backward compatible: default max_retries remains 3; existing callers
  unaffected.
- Regression tests cover: repeated-failures trigger, metadata bound,
  idempotent fail, dead-letter isolation, exhausted enqueue rejection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ Bounty $2k ] [ Queue ] Bound retry metadata growth — repeated failures

1 participant