Skip to content

feat(oracle): pin follow-up tasks to the owning worker (session→worker affinity)#979

Merged
chrono-kw merged 3 commits into
mainfrom
feat/oracle-session-worker-affinity
Jun 19, 2026
Merged

feat(oracle): pin follow-up tasks to the owning worker (session→worker affinity)#979
chrono-kw merged 3 commits into
mainfrom
feat/oracle-session-worker-affinity

Conversation

@kaihuei0114

Copy link
Copy Markdown
Collaborator

Closes #978.

Problem

oracle_task_service::claim_task claimed the oldest queued task filtered only by pool + status, with no notion of which account/worker a conversation belongs to. On a multi-account shared pool, a follow-up for account A's conversation could be claimed by account B's worker, which cannot open A's /c/<id> → wrong/blank/failed answer. This made multi-turn unusable on any shared pool. A fixed conversation_id pins which conversation, not which account, so it can't be fixed client-side.

Solution

Pin follow-ups to the worker/account that created the session; keep fresh tasks competitive.

  • Modelowner_worker_label: Option<String> on OracleSession; required_worker_label: Option<String> on OracleTask. Both serde(default, skip_serializing_if = None), so legacy docs stay unpinned and the fields never surface in consumer/worker payloads.
  • Stamp owner on first answerworker_submit_result sets owner_worker_label on the first successful result via a { owner_worker_label: null } filter (idempotent → the first account to answer keeps ownership). Same stamp added to the attach/transcript-import path, since that account physically owns the scraped /c/<id>.
  • Pin at submitsubmit_task copies the session's owner onto follow-ups as required_worker_label.
  • Filter at claimclaim_task queued filter now ANDs $or: [ {required_worker_label: null}, {required_worker_label: <worker>} ]: fresh (null/absent) → any worker; pinned follow-up → owning worker only.
  • Index — added { pool_id, status, required_worker_label, created_at } on oracle_tasks.

Workers and the CDP/userscript clients need no changerequired_worker_label rides along in the existing task and the client already obeys conversation_url.

Backward compatibility

  • Legacy sessions with no owner_worker_label stay unpinned (today's behavior).
  • If the owning account's worker is offline, its follow-ups wait in queue (correct — you cannot continue account A's chat without account A).

Tests

Added followup_pins_to_owning_worker: owner stamping on first result, follow-up pinning, a non-owner worker idling rather than misrouting, and fresh tasks still claimable by any worker. Updated affected struct-literal fixtures.

⚠️ Not compiled locally — this environment has no Rust toolchain. Relying on CI for cargo build/cargo test.

…r affinity)

Closes #978.

A follow-up task could be dispatched to a worker on a different ChatGPT
account than the one that owns the conversation, making multi-turn
unusable on any shared / multi-account pool. Pin follow-ups to the
worker/account that created the session; keep fresh tasks competitive.

- Add `owner_worker_label` to OracleSession and `required_worker_label`
  to OracleTask (both serde-optional → legacy docs stay unpinned).
- Stamp `owner_worker_label` on the first successful result
  (worker_submit_result) and on imported transcripts (attach), using a
  `{ owner_worker_label: null }` filter so the first account to answer
  keeps ownership.
- submit_task copies the session owner onto follow-ups as
  `required_worker_label`.
- claim_task filters queued tasks by
  `$or: [ {required_worker_label: null}, {required_worker_label: <worker>} ]`
  so fresh tasks go to any worker and follow-ups only to the owner.
- Add a `{ pool_id, status, required_worker_label, created_at }` index.

Workers and the CDP/userscript clients need no change.
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

📊 Code coverage

Component Lines Threshold Status Δ vs base
Backend (nyxid) 77.84% 73% 🔺 +0.03

Gate: line coverage must stay at or above the threshold. Ratchet plan (W21): Backend → 55%, CLI → 50%, Frontend → 30% by quarter end.

@chrono-kw

Copy link
Copy Markdown
Contributor

Review: solid, well-scoped fix — okay to merge, with two things worth surfacing first

Pulled the branch and read the full claim_task / submit_task / worker_submit_result / worker_payload context, the index change, the TTL/quota model, and cross-checked every struct-literal site. The core logic is correct and genuinely backward-compatible: follow-ups pin to the account that created the conversation, fresh tasks stay competitively load-balanced.

What's right

  • MongoDB semantics are correct. { required_worker_label: null } matches both absent and explicit-null, so fresh tasks (field omitted via skip_serializing_if) and legacy docs all fall into the "claimable by anyone" branch. Same trick on the owner_worker_label: null stamp filter gives idempotent first-writer-wins ownership. The subtle part is done right.
  • No head-of-line blocking. The filter is applied before the created_at sort, so a follow-up pinned to an offline worker doesn't stall other workers — they skip it and claim the next matching task.
  • All 11 struct-literal sites updated (5× OracleSession, 6× OracleTask), checked each against the diff. Should compile despite the "not compiled locally" caveat; CI's cargo build/cargo test remains the real gate.
  • Index {pool_id, status, required_worker_label, created_at} supports both $or branches plus the sort; the old {pool_id, status, created_at} is retained for the other queries.
  • Test coverage is on-point: owner stamping, follow-up pinning, non-owner idling instead of misrouting, and fresh-task load-balancing all asserted.

Two things to flag before/with merge

1. Pinned follow-ups whose owning worker never returns become permanent zombies that consume quota. Chain of facts:

  • db.rs:1695 — queued tasks have no expires_at and are never TTL-expired by Mongo.
  • A queued task is never dispatched to a non-matching worker, so its lease never starts → requeue_expired_leases never touches it.
  • enforce_submit_quotas counts status ∈ {queued, dispatched} against both per_user_max_inflight (per submitter) and max_queue_length (whole pool).

So if account A's worker disappears or reconnects under a different label, A's follow-ups sit queued forever, silently eating the submitter's inflight budget and pool queue capacity with no automatic recovery. The issue explicitly defers "lease/age fallback later," so it's known — suggest tracking it as a follow-up (an age-based unpin that clears required_worker_label, or excluding unclaimable-pinned tasks from quota). The common single-account pool case never triggers this.

2. Affinity keys on a client-chosen worker_label, not a durable account identity. Correctness rests on the operational invariant "one stable label ⇔ one account":

  • Two tabs of the same account under different labels (?nyx=1 vs ?nyx=2) are treated as different accounts → over-pinning (tab_2 could serve it but won't), and exposure to the zombie case if tab_1 closes.
  • If two different accounts ever reuse the same label, you silently get back the exact misroute this PR fixes.

Neither is a regression, but since the fix lives or dies by that invariant, it'd be worth a line in the code/docs.

Minor (non-blocking)

  • On a failed first result, turn_count is incremented and chatgpt_url may be pinned (worker_submit_result), but owner_worker_label intentionally is not (correct). Narrow consequence: a failed turn-1 that still reported a URL leaves a session with a pinned URL but no owner, so the next follow-up is unpinned and may land elsewhere. Acceptable, but untested — worth a test if you want completeness.
  • Owner stamping is a second update_one after the turn-count bump. Fine as-is (the null-guard can't be folded into the unconditional bump), just an extra round-trip.

…iant

Addresses review feedback on the session→worker affinity PR.

- Add release_stale_affinity(): a follow-up pinned to an owning account
  whose worker never returns is unpinned once it has waited a full
  task_timeout_secs window, so it stops leaking the submitter's inflight
  quota and the pool queue cap forever (the "lease/age fallback" the issue
  deferred). Swept in claim_task, which every live worker polls.
- Document the worker_label⇔account invariant that affinity rests on, on
  OracleSession::owner_worker_label.
- Tests: stale_followup_affinity_is_released_after_grace (pinned →
  unclaimable before grace, claimable by any worker after) and
  failed_first_turn_leaves_session_unowned (failed turn 1 counts the turn
  and pins the URL but does not stamp ownership, so the next follow-up
  stays unpinned).
@kaihuei0114

Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough read — addressed all three in bc3353b.

1. Zombie pinned follow-ups consuming quota — implemented the issue's deferred lease/age fallback as release_stale_affinity(): a follow-up pinned to an owning worker that never returns is unpinned once it has waited a full task_timeout_secs window (generous enough to absorb a tab reload / network blip), after which any worker may claim it. A non-owner that picks it up can't reopen the other account's /c/<id> and reports an extraction failure — but that's a terminal state that frees the inflight + queue quota instead of leaking it forever. It's swept in claim_task, which every live worker polls continuously, so recovery runs as long as any worker in the pool is alive. (A fully dead pool already hangs all queued tasks regardless of this PR, so no new behavior there.) Covered by stale_followup_affinity_is_released_after_grace.

2. Affinity keys on a client-chosen worker_label — documented the one-stable-label-⇔-one-account invariant on OracleSession::owner_worker_label, including the same-account-two-labels (over-pin, harmless) and two-accounts-one-label (misroute) failure modes, and noted that label assignment lives in the CDP/userscript clients (?nyx=Ntab_N).

3. Minor — failed turn-1 with a URL but no owner — added failed_first_turn_leaves_session_unowned: confirms a failed first turn counts the turn and pins chatgpt_url but does not stamp owner_worker_label, so the next follow-up stays unpinned.

The {pool_id, status, required_worker_label, created_at} index also serves the new release sweep's filter. Verified locally against MongoDB: 27/27 oracle_task tests pass, cargo build + clippy clean.

@chrono-kw

Copy link
Copy Markdown
Contributor

Re-reviewed bc3353b — all three addressed cleanly, and good to see it compiled + tested locally this round (27/27, clippy clean). 👍

  • Stale-affinity release is the right shape: swept in claim_task alongside requeue_expired_leases, bounded by queue length, idempotent (released tasks no longer match required_worker_label: {$ne: null}), and recovery frees the quota because the resulting failed task gets expires_at and drops out of the inflight/queue counts. Index covers the sweep filter too.
  • Invariant doc on owner_worker_label and the failed-first-turn test both land.

One residual assumption, not a blocker — flagging for the record: when a released follow-up is claimed by a non-owner, worker_payload still resolves conversation_url to the owner's /c/<id>. The quota-recovery story relies on the non-owner client failing to open that URL and returning ERROR: (→ terminal → quota freed). If a client instead silently starts a fresh conversation and answers, the task completes as a context-free "success" on the wrong account. Strictly better than the pre-PR always-misroute and only after the multi-hour grace, so fine to merge — just worth knowing the clean-recovery property depends on that client-side behavior.

LGTM.

@chrono-kw chrono-kw merged commit 71df6b0 into main Jun 19, 2026
26 checks passed
@chrono-kw chrono-kw deleted the feat/oracle-session-worker-affinity branch June 19, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Oracle relay: pin follow-up tasks to the originating worker (session→worker affinity) so multi-account shared pools support correct multi-turn

2 participants