feat(deriver): add stale batch timeout to prevent indefinite queue stall by maxxqf-ai · Pull Request #703 · plastic-labs/honcho

maxxqf-ai · 2026-05-20T11:34:11Z

Problem

When REPRESENTATION_BATCH_MAX_TOKENS is configured and FLUSH_ENABLED is false, work units that never accumulate enough tokens remain stuck in the queue indefinitely. This is especially problematic for short conversational sessions where individual session message tokens never reach the batch threshold.

In our deployment (Hermes agent with FLUSH at ~2K tokens but batch threshold at 4K), short sessions would produce ~2K tokens per flush — never reaching the 4K threshold. This resulted in 245 queue items accumulated over 10 days with zero processing.

Why not just use FLUSH_ENABLED=true?

FLUSH mode processes every message immediately, which produces fragmented observations lacking context. The extracted information is broken into tiny pieces with high redundancy and poor quality. The token-batch threshold exists to group related messages together for richer, more coherent observations.

The trade-off: batch mode gives better quality, but work units that never reach the threshold are never processed at all.

Solution

Add a STALE_BATCH_HOURS configuration option (default 0 = disabled) to DeriverSettings. When set to a positive value, work units whose oldest queue item exceeds this age are force-processed regardless of token count.

This provides a safety net that:

Preserves batch quality for active sessions (messages are still grouped up to the token threshold)
Ensures no work unit is abandoned forever (stale work units get processed after the timeout)
Is fully opt-in with a default of 0 (no behavior change for existing deployments)

Changes

src/config.py: Add STALE_BATCH_HOURS field to DeriverSettings with validation (0-168 hours)
src/deriver/queue_manager.py: Extend get_and_claim_work_units() to include stale work units via a subquery on MIN(QueueItem.created_at) when STALE_BATCH_HOURS > 0

Configuration Example

[deriver]
REPRESENTATION_BATCH_MAX_TOKENS = 2048
FLUSH_ENABLED = false
STALE_BATCH_HOURS = 5  # Force-process work units older than 5 hours

Summary by CodeRabbit

Configuration
- Added STALE_BATCH_HOURS setting (0-168 hours) to control work unit processing thresholds.
Improvements
- Enhanced queue management to process items based on age when batching conditions are applied.

When REPRESENTATION_BATCH_MAX_TOKENS is set and FLUSH_ENABLED is false, work units that never accumulate enough tokens remain stuck in the queue indefinitely. This is especially problematic for short conversational sessions where the FLUSH size is smaller than the batch threshold. The problem: FLUSH mode (immediate processing) produces fragmented observations lacking context — information is broken into tiny pieces with high redundancy. The token-batch threshold solves this by grouping messages for richer context. However, work units that never reach the threshold are never processed at all. Solution: Add STALE_BATCH_HOURS config (default 0 = disabled). When set to a positive value, work units whose oldest queue item exceeds this age are force-processed regardless of token count. This provides a safety net that preserves batch quality for active sessions while ensuring no work unit is abandoned forever. Changes: - src/config.py: Add STALE_BATCH_HOURS field to DeriverSettings - src/deriver/queue_manager.py: Extend get_and_claim_work_units() to include stale work units when STALE_BATCH_HOURS > 0

coderabbitai · 2026-05-20T11:34:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d1bebcb0-842d-4104-8c59-8f549de5df23

📥 Commits

Reviewing files that changed from the base of the PR and between b0f0295 and 725a982.

📒 Files selected for processing (2)

src/config.py
src/deriver/queue_manager.py

Walkthrough

This PR adds a stale batch timeout mechanism to the deriver queue manager. A new STALE_BATCH_HOURS configuration field (default 0, bounds 0–168) controls when representation work units are force-processed based on queue age, and the queue manager's batching logic is extended to query the oldest unprocessed item per work unit and force claims when that age exceeds the threshold.

Changes

Stale Batch Timeout

Layer / File(s)	Summary
Stale batch hours configuration `src/config.py`	`DeriverSettings.STALE_BATCH_HOURS` field with default `0` and bounds `0..168`, representing the age threshold in hours above which representation work units are force-processed.
Queue batching with stale threshold `src/deriver/queue_manager.py`	`QueueManager.get_and_claim_work_units` batching logic is extended with an `oldest_item_subq` outer join and conditional clause to force representation work unit claims when the oldest unprocessed item exceeds `STALE_BATCH_HOURS`, with fallback to token-count-only behavior when the threshold is non-positive.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hops through queue so long,
And checks: "Are items here too strong?"
When age exceeds the threshold set,
The oldest items? Time to get!
Hop — process them before they fret. 🐰⏰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a stale batch timeout mechanism to prevent indefinite queue stalling.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

menhguin · 2026-05-23T10:24:46Z

+1, would love to see this merged. We're hitting the exact symptom on the managed instance (api.honcho.dev).

Our context:

Hosted workspace, OpenClaw integration ingesting messages from chat sessions across ~6 conversational surfaces (Slack, Telegram, Signal, etc.)
Many sessions are short — single-digit message counts is common for one-off threads or thread-replies that route into their own session
Deriver processes sporadically. Over the past 24h: long stretches of zero derivations punctuated by short bursts. Some sessions <10 messages have been queued for days without a representation update.

What we tried before finding this PR:

Lowered MESSAGE_THRESHOLD to 2 (suspecting per-session message count was gating). Briefly spiked throughput to ~70 derivations/hour, then stalled again — consistent with this PR's root cause: lowering the per-session threshold gets more work units onto the queue, but they still never reach REPRESENTATION_BATCH_MAX_TOKENS, so the queue manager never claims them.
Considered toggling FLUSH_ENABLED=true but that gives up the token-batching efficiency — not what we want as a permanent answer.
Verified the deriver worker process is running and not crashed.

Why this PR is the right shape:

Gated behind STALE_BATCH_HOURS > 0 (default 0 = current behavior) — strictly additive, no regression risk for existing deployments.
Composes with FLUSH_ENABLED cleanly: token-threshold-first, stale-timeout as a backstop.
The 0–168 upper bound is a sensible safeguard.

Requests, if helpful:

Once merged, please consider exposing STALE_BATCH_HOURS as a per-workspace or per-instance setting on the managed api.honcho.dev deployment — currently we have no knob for this from the hosted side.
Sensible default for hosted: something like 2–6 hours would catch our case without much queue churn.
Slightly orthogonal but related: it would help operators if honcho.queue_status(...) surfaced the oldest unprocessed created_at per work unit, so stalled work units are diagnosable without DB access.

Happy to test against a self-hosted build with this patch applied if it helps validate before merge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deriver): add stale batch timeout to prevent indefinite queue stall#703

feat(deriver): add stale batch timeout to prevent indefinite queue stall#703
maxxqf-ai wants to merge 1 commit into
plastic-labs:mainfrom
maxxqf-ai:feat/stale-batch-timeout

maxxqf-ai commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

menhguin commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxxqf-ai commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Why not just use FLUSH_ENABLED=true?

Solution

Changes

Configuration Example

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

menhguin commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxxqf-ai commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading