fix: bound retry metadata growth in queue job serializer (#3947)#1
Open
neuralmint wants to merge 1 commit into
Open
fix: bound retry metadata growth in queue job serializer (#3947)#1neuralmint wants to merge 1 commit into
neuralmint wants to merge 1 commit into
Conversation
Bounty #3947 — Bound retry metadata growth on repeated failures. Changes: - Added MAX_RETRY_METADATA hard cap (100) to prevent unbounded retry counter growth. - Added dead_letter store for permanently failed tasks. - fail() now enforces the repeated-failures invariant before re-enqueueing: tasks that exceed max_retries or the hard cap go to dead letter instead. - enqueue() rejects tasks past the hard cap and returns None for the caller to handle. - Added preserve_retries parameter to enqueue() so retry metadata is preserved during re-enqueue (idempotent retry path). - Scheduled task promotion (dequeue) also respects the hard cap. - Added detailed logging for all retry/rejection decisions. - Backward compatible: default max_retries remains 3; existing callers unaffected. - Regression tests cover: repeated-failures trigger, metadata bound, idempotent fail, dead-letter isolation, exhausted enqueue rejection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3947 — Bound retry metadata growth when repeated failures are triggered while an agent run, task, or handler is changing lifecycle state.
Problem
The
TaskScheduler.fail()method incrementedtask["retries"]without any upper bound. When the repeated-failures path was exercised, retry metadata could grow unbounded, causing the component to accept stale, duplicate, or policy-violating lifecycle transitions. Impact: jobs could be duplicated, starved, lost, or delivered under wrong conditions.Fix
MAX_RETRY_METADATA = 100): prevents unbounded growth.fail()caps the increment; if already at the cap, it remains there (idempotent).max_retriesor the hard cap are moved to_dead_letterinstead of being silently dropped. Logs explain the decision.enqueue()now rejects tasks whose retry count already exceeds the hard cap, returningNone.preserve_retriesparameter toenqueue()so that the fail-re-enqueue cycle preserves retry metadata instead of resetting it.Acceptance Criteria
Proof
Closes #3947