Skip to content

feat(inbox): retain_until_acked + auto_ack_on_pull delivery model#23

Merged
dundas merged 4 commits into
mainfrom
feat/retain-until-acked
Mar 1, 2026
Merged

feat(inbox): retain_until_acked + auto_ack_on_pull delivery model#23
dundas merged 4 commits into
mainfrom
feat/retain-until-acked

Conversation

@dundas
Copy link
Copy Markdown
Owner

@dundas dundas commented Mar 1, 2026

Summary

  • retain_until_acked — new per-message flag (sender-set). When true, the hub requires explicit POST /ack before removing the message, regardless of the recipient's auto_ack_on_pull preference. Work orders and fix requests always set this automatically server-side (by message type), so important work is never silently consumed.

  • auto_ack_on_pull — new per-agent flag (set at registration, default false). When true, the hub immediately acks the message on pull — fire-and-forget delivery for agents that process messages without needing retry guarantees. Existing agents are unaffected (opt-in only).

  • Safety: the auto-ack path in pull() is failure-safe. If the ack storage write throws, the message stays leased and requeues after the visibility timeout — a failed ack is always safer than a silently consumed message.

Files changed

File Change
src/routes/agents.js Accept auto_ack_on_pull in registration, include in response
src/services/agent.service.js Store auto_ack_on_pull: false on agent record
src/routes/inbox.js Accept retain_until_acked in send body
src/services/inbox.service.js Store retain_until_acked in send(); auto-set for work_order/fix_request; honor auto_ack_on_pull in pull() with failure-safe ack

Test plan

  • All existing inbox tests pass (send->pull->ack flow, nack, ephemeral, trust list)
  • 140+ tests pass; only pre-existing outbox failures remain (unrelated to this PR)
  • Manual smoke: register agent with auto_ack_on_pull: true, send a notification, pull — verify auto_acked: true in response
  • Manual smoke: send a work_order, pull with auto_ack_on_pull agent — verify hub still requires explicit ack (retain_until_acked overrides auto-ack)

Related

Completes the explicit ack model work order from decisive.gm. Previous commits:

  • a74282e removed auto-ack from pull loop, cmdAck calls hub
  • f30514c visibility timeout race fix (1h default + _acked/ guard)

🤖 Generated with Claude Code

Agent registration now accepts auto_ack_on_pull (default: false).
When true, the hub immediately acks messages on pull so the agent
does not need to call the ack endpoint. Opt-in -- no behaviour
change for existing agents.

Message send now accepts retain_until_acked (default: false). When
true, the hub ignores the recipient's auto_ack_on_pull preference
and requires explicit ack. Work orders and fix requests always set
this automatically.

Safety: in pull(), if the auto-ack write throws, the message stays
leased and requeues after the visibility timeout rather than being
silently consumed. The auto_acked flag is included in the pull
response so callers know not to ack again.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 1, 2026

PR Review: feat(inbox): retain_until_acked + auto_ack_on_pull

The design intent is sound and the safety rationale (prefer requeue over silent loss) is well-reasoned. A few issues worth addressing before merge:


Bug: auto_acked flag is silently dropped by the pull route

inboxService.pull() returns { ...leased, auto_acked: true } for auto-acked messages, but src/routes/inbox.js only serializes four fixed fields:

res.json({
  message_id: message.id,
  envelope: message.envelope,
  lease_until: message.lease_until,   // stale — message is already acked in storage
  attempts: message.attempts
  // auto_acked never forwarded
});

Clients receiving a pull response have no way to know the message was auto-acked and must not call POST .../ack. If they do, ack() will reject with "Message must be leased before ack (current status: acked)" — a confusing error for a perfectly correct-looking flow. The route should surface auto_acked when present and ideally omit lease_until (or set it to null) for auto-acked messages.


State discrepancy in auto-ack return value

Related to the above: when auto-ack succeeds, the service returns { ...leased, auto_acked: true }. The leased snapshot still carries status: 'leased' and a future lease_until, but the record in storage is status: 'acked'. Any consumer of the service API (test code, future callers) will see inconsistent state. Consider returning an explicitly constructed object with the correct status: 'acked' and lease_until: null.


RETAIN_TYPES set allocated on every send() call

// inbox.service.js, inside send()
const RETAIN_TYPES = new Set(['work_order', 'fix_request']);

This allocates a new Set on every message send. It should be a module-level constant, matching the pattern already used for SAFE_CHARS and VALID_AGENT_URI at the top of the file.


Extra storage round-trip on every pull

const agent = await agentService.getAgent(agentId);
if (agent?.auto_ack_on_pull && !leased.retain_until_acked) { ... }

This adds one unconditional storage read after every lease, for every agent — including the majority that have auto_ack_on_pull: false. For high-throughput inboxes this will be felt. Consider fetching the agent record once at the start of pull() (before the lease), or passing the preference in as an option so the route can provide it without a second lookup.


Missing automated tests (CLAUDE.md requirement)

CLAUDE.md: "Add or update tests for behavior changes."

The PR's own test plan marks the two core new behaviors as manual-only (unchecked). No automated tests cover:

  • auto_ack_on_pull: true agent pulling a normal message → response includes auto_acked: true, message reaches status: 'acked'
  • retain_until_acked: true message pulled by an auto_ack_on_pull agent → auto-ack is skipped, explicit ack required
  • Server-side RETAIN_TYPES promotion (work_order, fix_requestretain_until_acked: true)
  • Failure-safe path: ack write throws → message stays leased, nothing returned

These are exactly the scenarios most likely to regress silently.


Missing documentation update (CLAUDE.md requirement)

CLAUDE.md: "Document behavior changes in docs/AGENT-GUIDE.md and/or README.md."

Neither file is updated. Operators and integrators need to know:

  • How to opt in to auto_ack_on_pull at registration
  • That work_order and fix_request messages are always retained regardless of agent preference
  • The retain_until_acked flag and its interaction with auto-ack

Minor: no input validation on auto_ack_on_pull

The !! coercion in agent.service.js normalizes the value, but the field arrives from user input with no documented type constraint. An explicit boolean check or schema validation would be more consistent with the project's security-defaults posture.


Summary

# Severity Description
1 Bug auto_acked not forwarded in pull route; lease_until is stale for auto-acked messages
2 Bug Service returns status: 'leased' snapshot when storage already has status: 'acked'
3 Medium RETAIN_TYPES Set re-allocated on every send() call
4 Medium Unconditional extra storage read on every pull()
5 Medium No automated tests for the two new delivery paths
6 Medium docs/AGENT-GUIDE.md / README.md not updated per CLAUDE.md
7 Low No type validation on auto_ack_on_pull input

The safety design (retain-on-ack-failure > silent loss) is correct and the retain_until_acked override logic is clean. Items 1–2 are functional bugs that should block merge; 3–6 should be addressed given the project's stated standards.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d23c7f8cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/services/inbox.service.js Outdated
Comment on lines +231 to +234
await storage.updateMessage(message.id, {
status: 'acked',
acked_at: Date.now()
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve ephemeral purging when auto-acking pulled messages

When auto_ack_on_pull is enabled, this path writes status: 'acked' directly and bypasses ack(), so ephemeral: true messages no longer have their body purged on ack. In that scenario, sensitive payloads remain stored in envelope.body (potentially indefinitely if no TTL is set), which breaks the documented ephemeral guarantee and changes security behavior for opted-in agents.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 1, 2026

Greptile Summary

Adds two-tier delivery model: retain_until_acked (sender-set flag requiring explicit ack) and auto_ack_on_pull (recipient preference for fire-and-forget). Work orders and fix requests automatically set retain_until_acked server-side to prevent silent data loss.

  • Implementation correctly ensures retain_until_acked overrides auto_ack_on_pull preference
  • Auto-ack path properly throws on storage failure (message stays leased and requeues)
  • Missing auto_acked field in pull endpoint response needs to be added

Confidence Score: 4/5

  • Safe to merge after fixing the missing response field
  • Core logic is sound with proper failure-safe design, but the missing auto_acked field in the API response will cause the manual smoke test to fail
  • src/routes/inbox.js requires the auto_acked field to be added to the pull response

Important Files Changed

Filename Overview
src/routes/inbox.js added retain_until_acked parameter to send endpoint; missing auto_acked field in pull response
src/services/inbox.service.js implemented auto-ack logic with proper override for work_order/fix_request types — failure-safe design

Last reviewed commit: 0d23c7f

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 1, 2026

Additional Comments (1)

src/routes/inbox.js
missing auto_acked field in response — the service returns { ...leased, auto_acked: true } but it's not included here

    res.json({
      message_id: message.id,
      envelope: message.envelope,
      lease_until: message.lease_until,
      attempts: message.attempts,
      auto_acked: message.auto_acked
    });

Bug fixes:
- auto_acked flag now forwarded in pull route response; lease_until
  set to null for auto-acked messages so callers know not to ack
- pull() now skips the lease entirely for auto_ack agents and returns
  status:'acked' with auto_acked:true (previously returned stale
  status:'leased' snapshot while storage already held 'acked')

Medium fixes:
- RETAIN_TYPES moved to module-level constant (was re-allocated on
  every send() call)
- Agent lookup moved before the lease in pull() so a failed lookup
  cannot leave the message in a leased state; auto_ack agents now
  skip the lease step entirely (cleaner state, one fewer write)
- 4 automated tests added covering auto_ack_on_pull, retain_until_acked
  override, work_order RETAIN_TYPE auto-promotion, and boolean validation
- docs/AGENT-GUIDE.md updated with retain_until_acked + auto_ack_on_pull
  usage guide per CLAUDE.md documentation requirement

Low fix:
- auto_ack_on_pull validated as boolean at registration (returns 400
  INVALID_INPUT on non-boolean value)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 1, 2026

PR Review: feat(inbox): retain_until_acked + auto_ack_on_pull

The design is well-reasoned and the safety rationale (prefer requeue over silent loss) is solid. Documentation and test coverage are good. A few issues worth addressing before merge:


Bug: Missing input validation for retain_until_acked

auto_ack_on_pull has explicit boolean validation in src/routes/agents.js:

if (auto_ack_on_pull !== undefined && typeof auto_ack_on_pull !== 'boolean') {
  return res.status(400).json({ error: 'INVALID_INPUT', ... });
}

But retain_until_acked in src/routes/inbox.js has no equivalent check -- it is coerced directly:

retain_until_acked: retain_until_acked || false

A caller sending retain_until_acked: "yes" or retain_until_acked: 1 will silently succeed. The truthy string/number propagates into storage and the retain check works by accident (truthy check in if (!message.retain_until_acked)), but the API surface is inconsistent. Add the same pattern used for auto_ack_on_pull:

if (retain_until_acked !== undefined && typeof retain_until_acked !== 'boolean') {
  return res.status(400).json({ error: 'INVALID_INPUT', message: 'retain_until_acked must be a boolean' });
}

Performance: Unconditional agent lookup on every pull

// src/services/inbox.service.js, pull()
const agent = await agentService.getAgent(agentId);
const autoAck = agent?.auto_ack_on_pull ?? false;

This adds one storage read for every pull from every agent, even though the vast majority have auto_ack_on_pull: false. The fetch happens before the inbox read, making it two sequential storage round-trips on the hot path.

Options:

  1. Pass auto_ack_on_pull as an option from the route (which already has the authenticated agent context depending on your auth middleware).
  2. Cache the agent record briefly -- the preference is set at registration and rarely changes.
  3. At minimum, document the trade-off in a comment so a future reviewer does not remove the lookup assuming it is unnecessary.

Missing test for fix_request type in RETAIN_TYPES

server.test.js covers work_order forcing retain_until_acked server-side, but not fix_request. Both are in RETAIN_TYPES. A quick second assertion in the existing test or a separate case would close the gap.


Minor: Redundant !! in agent.service.js

auto_ack_on_pull: !!auto_ack_on_pull

The route already validates auto_ack_on_pull is a boolean before calling the service, so the double-bang is unnecessary. It implies uncertainty about the input type that does not exist at this point in the call stack.


What is done well

  • RETAIN_TYPES is correctly promoted to a module-level constant, matching the SAFE_CHARS pattern already in the file.
  • auto_acked: true is properly surfaced in the pull response, and lease_until is correctly null for auto-acked messages.
  • Agent preference is fetched before any write, preventing partial-state on lookup failure -- the inline comment explains this clearly.
  • The failure-safe behavior (ack write throws => message stays queued, not silently dropped) is correctly implemented and documented.
  • Test coverage for the main paths is solid: auto-ack happy path, retain_until_acked override, work_order server-side promotion, and invalid input rejection.
  • docs/AGENT-GUIDE.md update is clear and includes a useful decision table.

The retain_until_acked validation fix is the most important change before merging. The performance issue is worth tracking even if not blocking for the initial cut.

Bug fix:
- retain_until_acked now validated as boolean in send route (returns
  400 INVALID_INPUT on non-boolean value, matching auto_ack_on_pull)

Performance:
- pull() route passes auto_ack_on_pull from req.agent when the HTTP
  Signature middleware has already populated it, eliminating the
  redundant storage round-trip on the authenticated hot path. Falls
  back to a service-level lookup for legacy auth / test paths

Minor:
- Removed redundant !! coercion in agent.service.js (route validates
  the type before calling the service)
- Added fix_request RETAIN_TYPES test (previously only work_order
  was covered; both types are in RETAIN_TYPES)
- Added test for retain_until_acked non-boolean input rejection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 1, 2026

PR Review: feat(inbox): retain_until_acked + auto_ack_on_pull

The design is solid and the safety rationale is well-argued. This review focuses on genuine issues I found in the diff; I also note where earlier automated reviews misstated the state of the code.


Correcting prior review comments

Several points raised in the earlier automated reviews are already addressed in this diff and should not block merge:

Claim Reality
"auto_acked not forwarded in pull route" Route correctly spreads ...(message.auto_acked && { auto_acked: true }) and sets lease_until: null for auto-acked messages
"RETAIN_TYPES allocated on every send() call" RETAIN_TYPES is at module level, matching the SAFE_CHARS pattern
"No input validation on retain_until_acked" Explicit boolean check is present in src/routes/inbox.js before the service call
"No automated tests for new paths" Six new test cases cover auto-ack, retain override, work_order, fix_request, and invalid input rejection
"Missing docs update" docs/AGENT-GUIDE.md is updated with a clear explanation and decision table
"Missing fix_request test" test('fix_request type always sets retain_until_acked server-side', ...) is in the diff

Real issues

1. PR description inaccuracy - "stays leased" is wrong

In the auto-ack path, the message is never leased - we skip the lease step entirely. If the updateMessage write throws, the message stays queued (its original state), not leased. The safety property still holds (no silent loss), but the description is misleading and could confuse future maintainers who rely on it to understand the failure mode.

2. Redundant || false guards

Two places apply a || false fallback to values already guaranteed to be booleans or to have boolean defaults:

In agent.service.js: auto_ack_on_pull already defaults to false in the param list, so auto_ack_on_pull || false is redundant.

In inbox.js route: retain_until_acked is validated as boolean-or-undefined just above, so retain_until_acked || false is redundant.

These are not bugs (the coercion produces the right result), but they signal uncertainty about types that was just resolved by explicit validation a few lines earlier.

3. Fallback storage lookup on every pull without HTTP Signature auth

The route only populates req.agent when HTTP Signature auth succeeds. In tests (which do not send a Signature header, as the new test cases demonstrate) and for legacy auth paths, every pull() call incurs an extra storage read even though the result is false for the vast majority of agents.

For high-throughput inboxes this is two sequential storage reads per pull instead of one. Options (not necessarily blocking):

  • Thread the preference through from middleware to route to service so the service never needs to look it up.
  • At minimum, document the expected performance cost with a comment so a future reader does not assume the fallback is rare.

4. No atomicity guard on the auto-ack write

The auto-ack path reads queued messages, picks the oldest, then writes status: 'acked' unconditionally. Two concurrent pulls will both see the same message before either acks it. The standard lease path has the same potential race, so this does not introduce a new problem - but if the storage backend ever gains conditional-update support, the auto-ack path should use it.


What is done well

  • Safety invariant is correct: retain_until_acked always wins; RETAIN_TYPES auto-promotion happens server-side so callers cannot accidentally skip it.
  • Pull route response is correct: auto_acked: true only appears when applicable, and lease_until is null for auto-acked messages - clients cannot misread state.
  • RETAIN_TYPES module-level constant is placed correctly and commented well.
  • Test coverage is thorough: happy paths, override behavior, server-side promotion, and invalid input rejection.
  • AGENT-GUIDE.md update is clear and includes a useful decision table.

Summary

# Severity Item
1 Low PR description says "stays leased" - should say "stays queued"
2 Low Redundant `
3 Medium Fallback getAgent storage read on every pull without HTTP Signature auth
4 Low Auto-ack write has no conditional status guard (same race as standard pull)

None of these are blockers. Items 1-3 are worth cleaning up before merge; item 1 in particular should be corrected to avoid misleading future readers of the failure-safe rationale.

- Remove redundant `|| false` guards on already-validated booleans
  (auto_ack_on_pull has param default; retain_until_acked is validated
  as boolean-or-undefined just above the send() call)
- Expand inline comment on pull() fallback storage lookup to document
  the performance trade-off and future improvement path so the code
  is not incorrectly removed assuming the fallback is rare or free

PR description updated: "stays leased" -> "stays queued" (the auto-ack
path skips the lease step entirely; on ack write failure the message
stays in its original 'queued' state, not 'leased')

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 1, 2026

PR Review: retain_until_acked + auto_ack_on_pull delivery model

Overall this is a well-thought-out feature with good documentation and test coverage. The safety-first design (failed auto-ack keeps the message leased for retry) is exactly right. A few issues worth addressing before merge:


Bug: Auto-ack path skips ephemeral body purge

src/services/inbox.service.js, auto-ack block (~line 240–247)

The regular ack() method handles ephemeral messages specially — it purges the body and sets status to 'purged' instead of 'acked'. The new auto-ack path bypasses this entirely:

if (autoAck && !message.retain_until_acked) {
  const acked = await storage.updateMessage(message.id, {
    status: 'acked',       // sets 'acked' unconditionally
    acked_at: Date.now(),
    attempts: message.attempts + 1
  });
  return { ...acked, auto_acked: true };
}

If a message has ephemeral: true and is pulled by an auto_ack_on_pull agent, the body will never be purged. The whole point of ephemeral messages is that the body is deleted on ack. This needs to mirror the ephemeral branch in ack():

if (message.ephemeral) {
  // purge body, set status 'purged'
} else {
  // set status 'acked'
}

There is no test covering this scenario — adding one (ephemeral: true message sent to an auto_ack_on_pull agent) would catch it.


Minor: Dead code in retainUntilAcked computation

src/services/inbox.service.js, line 140

const retainUntilAcked = options.retain_until_acked || RETAIN_TYPES.has(envelope.type) || false;

RETAIN_TYPES.has(...) already returns boolean, so || false is unreachable and misleading. Can just be:

const retainUntilAcked = options.retain_until_acked || RETAIN_TYPES.has(envelope.type);

Misleading error message for ack after auto-ack

src/services/inbox.service.js, ack() method (~line 279)

When a client calls POST .../ack on an already-auto-acked message, they get:

"Message must be leased before ack (current status: acked)"

This is technically correct but confusing — a caller who missed the auto_acked: true flag in the pull response won't understand why the message is already in 'acked' state without being leased. Referencing auto_acked in the error message would help.


Untested code path: HTTP Signature optimization in route

src/routes/inbox.js, lines 81–83

if (req.agent?.auto_ack_on_pull !== undefined) {
  pullOpts.auto_ack_on_pull = req.agent.auto_ack_on_pull;
}

The tests bypass HTTP Signature auth (the service comment acknowledges "test paths that do not send a Signature header"). This means the fast path where req.agent is populated is never exercised — tests always fall through to the service-level storage lookup. This is a gap worth documenting.


Suggestion: Surface retain_until_acked in status response

src/services/inbox.service.js, getStatus() (~line 430–438)

The status endpoint does not return retain_until_acked. This makes it hard to debug cases where a message is unexpectedly stuck waiting for explicit ack (e.g. a sender didn't realize work_order auto-sets the flag). A low-risk improvement.


Design note: No update path for auto_ack_on_pull

Agents can only set auto_ack_on_pull at registration with no way to change it later without re-registering. If this is intentional (immutable delivery preference), it should be stated explicitly in the agent guide.


Test coverage gaps

The new tests cover the critical cases well. Missing scenarios:

  • Ephemeral + auto-ack (body should be purged, status should be 'purged' not 'acked')
  • HTTP Signature fast path for auto_ack_on_pull (route-level optimization is untested)
  • nack on an auto-acked message (behavior is untested — does it error as expected?)

Overall the design is sound and the safety invariants (retain_until_acked > auto_ack_on_pull, ack failure keeps message leased) are correct. The ephemeral body purge issue is the most important thing to fix before merge.

@dundas dundas merged commit f739599 into main Mar 1, 2026
2 checks passed
@dundas dundas deleted the feat/retain-until-acked branch March 1, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant