Skip to content

feat(deriver): add stale batch timeout to prevent indefinite queue stall#703

Open
maxxqf-ai wants to merge 1 commit into
plastic-labs:mainfrom
maxxqf-ai:feat/stale-batch-timeout
Open

feat(deriver): add stale batch timeout to prevent indefinite queue stall#703
maxxqf-ai wants to merge 1 commit into
plastic-labs:mainfrom
maxxqf-ai:feat/stale-batch-timeout

Conversation

@maxxqf-ai
Copy link
Copy Markdown

@maxxqf-ai maxxqf-ai commented May 20, 2026

Problem

When REPRESENTATION_BATCH_MAX_TOKENS is configured and FLUSH_ENABLED is false, work units that never accumulate enough tokens remain stuck in the queue indefinitely. This is especially problematic for short conversational sessions where individual session message tokens never reach the batch threshold.

In our deployment (Hermes agent with FLUSH at ~2K tokens but batch threshold at 4K), short sessions would produce ~2K tokens per flush — never reaching the 4K threshold. This resulted in 245 queue items accumulated over 10 days with zero processing.

Why not just use FLUSH_ENABLED=true?

FLUSH mode processes every message immediately, which produces fragmented observations lacking context. The extracted information is broken into tiny pieces with high redundancy and poor quality. The token-batch threshold exists to group related messages together for richer, more coherent observations.

The trade-off: batch mode gives better quality, but work units that never reach the threshold are never processed at all.

Solution

Add a STALE_BATCH_HOURS configuration option (default 0 = disabled) to DeriverSettings. When set to a positive value, work units whose oldest queue item exceeds this age are force-processed regardless of token count.

This provides a safety net that:

  • Preserves batch quality for active sessions (messages are still grouped up to the token threshold)
  • Ensures no work unit is abandoned forever (stale work units get processed after the timeout)
  • Is fully opt-in with a default of 0 (no behavior change for existing deployments)

Changes

  • src/config.py: Add STALE_BATCH_HOURS field to DeriverSettings with validation (0-168 hours)
  • src/deriver/queue_manager.py: Extend get_and_claim_work_units() to include stale work units via a subquery on MIN(QueueItem.created_at) when STALE_BATCH_HOURS > 0

Configuration Example

[deriver]
REPRESENTATION_BATCH_MAX_TOKENS = 2048
FLUSH_ENABLED = false
STALE_BATCH_HOURS = 5  # Force-process work units older than 5 hours

Related

Possibly related to #494 (messages ingested but derived memory does not appear automatically).

Summary by CodeRabbit

  • Configuration

    • Added STALE_BATCH_HOURS setting (0-168 hours) to control work unit processing thresholds.
  • Improvements

    • Enhanced queue management to process items based on age when batching conditions are applied.

Review Change Stack

When REPRESENTATION_BATCH_MAX_TOKENS is set and FLUSH_ENABLED is false,
work units that never accumulate enough tokens remain stuck in the queue
indefinitely. This is especially problematic for short conversational
sessions where the FLUSH size is smaller than the batch threshold.

The problem: FLUSH mode (immediate processing) produces fragmented
observations lacking context — information is broken into tiny pieces
with high redundancy. The token-batch threshold solves this by grouping
messages for richer context. However, work units that never reach the
threshold are never processed at all.

Solution: Add STALE_BATCH_HOURS config (default 0 = disabled). When set
to a positive value, work units whose oldest queue item exceeds this age
are force-processed regardless of token count. This provides a safety net
that preserves batch quality for active sessions while ensuring no work
unit is abandoned forever.

Changes:
- src/config.py: Add STALE_BATCH_HOURS field to DeriverSettings
- src/deriver/queue_manager.py: Extend get_and_claim_work_units() to
  include stale work units when STALE_BATCH_HOURS > 0
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d1bebcb0-842d-4104-8c59-8f549de5df23

📥 Commits

Reviewing files that changed from the base of the PR and between b0f0295 and 725a982.

📒 Files selected for processing (2)
  • src/config.py
  • src/deriver/queue_manager.py

Walkthrough

This PR adds a stale batch timeout mechanism to the deriver queue manager. A new STALE_BATCH_HOURS configuration field (default 0, bounds 0–168) controls when representation work units are force-processed based on queue age, and the queue manager's batching logic is extended to query the oldest unprocessed item per work unit and force claims when that age exceeds the threshold.

Changes

Stale Batch Timeout

Layer / File(s) Summary
Stale batch hours configuration
src/config.py
DeriverSettings.STALE_BATCH_HOURS field with default 0 and bounds 0..168, representing the age threshold in hours above which representation work units are force-processed.
Queue batching with stale threshold
src/deriver/queue_manager.py
QueueManager.get_and_claim_work_units batching logic is extended with an oldest_item_subq outer join and conditional clause to force representation work unit claims when the oldest unprocessed item exceeds STALE_BATCH_HOURS, with fallback to token-count-only behavior when the threshold is non-positive.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hops through queue so long,
And checks: "Are items here too strong?"
When age exceeds the threshold set,
The oldest items? Time to get!
Hop — process them before they fret. 🐰⏰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding a stale batch timeout mechanism to prevent indefinite queue stalling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@menhguin
Copy link
Copy Markdown

+1, would love to see this merged. We're hitting the exact symptom on the managed instance (api.honcho.dev).

Our context:

  • Hosted workspace, OpenClaw integration ingesting messages from chat sessions across ~6 conversational surfaces (Slack, Telegram, Signal, etc.)
  • Many sessions are short — single-digit message counts is common for one-off threads or thread-replies that route into their own session
  • Deriver processes sporadically. Over the past 24h: long stretches of zero derivations punctuated by short bursts. Some sessions <10 messages have been queued for days without a representation update.

What we tried before finding this PR:

  1. Lowered MESSAGE_THRESHOLD to 2 (suspecting per-session message count was gating). Briefly spiked throughput to ~70 derivations/hour, then stalled again — consistent with this PR's root cause: lowering the per-session threshold gets more work units onto the queue, but they still never reach REPRESENTATION_BATCH_MAX_TOKENS, so the queue manager never claims them.
  2. Considered toggling FLUSH_ENABLED=true but that gives up the token-batching efficiency — not what we want as a permanent answer.
  3. Verified the deriver worker process is running and not crashed.

Why this PR is the right shape:

  • Gated behind STALE_BATCH_HOURS > 0 (default 0 = current behavior) — strictly additive, no regression risk for existing deployments.
  • Composes with FLUSH_ENABLED cleanly: token-threshold-first, stale-timeout as a backstop.
  • The 0–168 upper bound is a sensible safeguard.

Requests, if helpful:

  1. Once merged, please consider exposing STALE_BATCH_HOURS as a per-workspace or per-instance setting on the managed api.honcho.dev deployment — currently we have no knob for this from the hosted side.
  2. Sensible default for hosted: something like 2–6 hours would catch our case without much queue churn.
  3. Slightly orthogonal but related: it would help operators if honcho.queue_status(...) surfaced the oldest unprocessed created_at per work unit, so stalled work units are diagnosable without DB access.

Happy to test against a self-hosted build with this patch applied if it helps validate before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants