Skip to content

fix(otc): settlement atomicity — CAS guards, escrow cleanup, honest completion#6803

Open
Scottcjn wants to merge 2 commits into
mainfrom
otc-settlement-atomicity-20260602
Open

fix(otc): settlement atomicity — CAS guards, escrow cleanup, honest completion#6803
Scottcjn wants to merge 2 commits into
mainfrom
otc-settlement-atomicity-20260602

Conversation

@Scottcjn
Copy link
Copy Markdown
Owner

@Scottcjn Scottcjn commented Jun 2, 2026

⚠️ Draft — money-path / CHAOTIC-tier. Requesting review before merge. Closes the settlement-atomicity race cluster from the otc-bridge tri-brain review (4 review loops: Codex 5.5 + Grok + GPT-OSS 120B, each finding folded in).

The unifying fix

Every money-path mutation now claims its state transition atomically (compare-and-swap) before any irreversible escrow/payout call, and only acts if it won the claim.

match_order

  • Release the taker's escrow on the lost-CAS (409) and exception paths — a concurrent-match loser's RTC is no longer orphaned in a job no order references.
  • Clear the escrow marker after a successful commit, so the except handler can't refund a live matched order.

cancel_order

  • CAS open→cancelled and commit before refunding — a commit failure can no longer roll back to open with escrow already gone (re-matchable with no escrow). Reports escrow_refunded.

auto-expire (list_orders + match)

  • Guard each expiry with AND status='open' — a stale expiry scan can't clobber/refund a just-matched row.
  • Commit the expired transition before refunding (two-phase), same ordering as cancel.

confirm_order

  • Atomic matched→settling claim before any external call — serializes double-confirm by design, not by accident of payout idempotency.
  • 3-way honest outcome: completed only on real payout success (+ trade row + preimage); settlement_recovery if escrow was released but payout failed (idempotent re-drive); revert to matched if escrow was never touched (retryable).
  • changes() guard on the completed UPDATE → no orphan trade row.
  • Exception recovery moves a committed settling claim to a recoverable state — never wedged. escrow_released is marked before the accept call, so an accept timeout (funds may have moved, no response) is treated as ambiguous → settlement_recovery, never a retryable matched that could double-release.

safe_refund_escrow()

Raise-proof post-commit refund helper (logs + alerts, never 500s the caller after a durable commit). Used by the expiry and cancel paths.

Tests — tests/test_otc_bridge_settlement_atomicity.py (8)

payout-failure→settlement_recovery (no trade, ok:false, 200); escrow-failure→matched (retryable); double-confirm rejected; mid-settlement exception recovery both before and after escrow release; cancel-of-matched doesn't refund. 53 otc tests pass.

⚠️ Known limitations (documented, separate follow-ups — not silently left)

  1. Crash between a durable commit and the following external call (settling→accept; cancel-commit→refund) can still strand state/funds. In-request recovery covers the exception case; closing the crash case needs a settling_at timestamp + a reconciler/sweeper. Follow-up PR.
  2. Payout ok = queued, not on-chain-confirmed — pre-existing optimistic completion. On-chain confirmation polling is a separate architectural change.
  3. Confirm response shape change is intentional: the old contract returned completed/ok:true even when the payout failed (it lied). New: ok reflects payout success, status ∈ {completed,settlement_recovery,matched}, trade_id/htlc_secret only on success, still HTTP 200 (the frontend branches on data.ok, not status code).

🤖 Generated with Claude Code

…ompletion

Closes the settlement-atomicity race cluster from the otc-bridge tri-brain
review (4 review loops: Codex + Grok + GPT-OSS). All money-path mutations now
claim their state transition atomically BEFORE any irreversible escrow/payout
call, and only act if they won the claim.

match_order:
 - release the taker's escrow on the lost-CAS (409) and exception paths so a
   concurrent-match loser's RTC is never orphaned;
 - clear the escrow marker after a successful commit so the except handler can
   never refund a live matched order.

cancel_order:
 - CAS open->cancelled and COMMIT before refunding, so a commit failure can't
   roll back to 'open' with escrow already gone; report escrow_refunded.

auto-expire (list_orders + match):
 - guard each expiry with AND status='open' (no clobbering a just-matched row);
 - commit the 'expired' transition BEFORE refunding (two-phase).

confirm_order:
 - atomic matched->settling claim before any external call (serializes
   double-confirm by design, not by accident of idempotency);
 - 3-way honest outcome: 'completed' only on real payout success (+ trade row +
   preimage), 'settlement_recovery' if escrow released but payout failed,
   revert to 'matched' if escrow untouched (retryable);
 - changes() guard on the completed UPDATE (no orphan trade);
 - exception recovery moves a committed 'settling' claim out to a recoverable
   state — never wedged. escrow_released marked BEFORE the accept call so an
   accept timeout is treated as ambiguous (-> settlement_recovery).

safe_refund_escrow(): raise-proof post-commit refund helper (logs + alerts,
never 500s the caller after a durable commit); used by expiry + cancel.

Tests: tests/test_otc_bridge_settlement_atomicity.py (8) — payout-failure,
escrow-failure-retryable, double-confirm rejected, mid-settlement exception
recovery (both before/after release), cancel-of-matched no-refund. 53 otc tests
pass.

KNOWN LIMITATIONS (documented; separate follow-ups, NOT silently left):
 - a process crash BETWEEN a durable commit and the following external call
   (settling->accept, cancel-commit->refund) needs a reconciler + a settling_at
   timestamp; in-request recovery covers the exception case only.
 - payout 'ok' means QUEUED, not on-chain-confirmed (pre-existing optimistic
   completion; on-chain confirmation polling is a separate change).
 - confirm response shape change is INTENTIONAL: the old contract reported
   completed/ok:true even when payout failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Scottcjn Scottcjn added the BCOS-L1 Beacon Certified Open Source tier BCOS-L1 (required for non-doc PRs) label Jun 2, 2026
@github-actions github-actions Bot added tests Test suite changes size/XL PR: 500+ lines labels Jun 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

⚠️ BCOS v2 Scan Results

Metric Value
Trust Score 52/100
Certificate ID BCOS-d3291212
Tier L1 (not met)

BCOS Badge

What does this mean?

The BCOS (Beacon Certified Open Source) engine scans for:

  • SPDX license header compliance
  • Known CVE vulnerabilities (OSV database)
  • Static analysis findings (Semgrep)
  • SBOM completeness
  • Dependency freshness
  • Test infrastructure evidence
  • Review attestation tier

Full report | What is BCOS?


BCOS v2 Engine - Free & Open Source (MIT) - Elyan Labs

Copy link
Copy Markdown
Contributor

@jaxint jaxint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for contributing to RustChain. Approved.

@Scottcjn
Copy link
Copy Markdown
Owner Author

Scottcjn commented Jun 2, 2026

🔴 Tri-brain review — CAS guard helps, but settlement state machine still unsafe (do not merge yet)

The compare-and-swap on settling -> completed is the right idea, but Codex + Grok converge that the surrounding flow has money-path gaps. 4 BLOCKING:

  1. Preimage released before payout confirms (:1362): any payout_result["ok"] is treated as completed, and the test even accepts {"phase": "pending"}. A queued transfer isn't settled — finalizing the order and returning the HTLC preimage before payout confirmation lets the counterparty claim while the payout may never land. Wait for confirmation.
  2. Refund stranding on crash (:930, :1530): commits expired/cancelled before safe_refund_escrow(). A crash in between strands funds — the order is no longer eligible for expiry/cancel retry and no alert fires. Needs a durable refund-pending state / outbox + startup reconciliation.
  3. Double-pay window (:1414): a failed settling->completed CAS after the irreversible payout just rolls back + 409s. If a recovery path flipped the row back to matched, the paid order stays retryable → pay twice. Needs an explicit recovery state + alert, not a silent revert.
  4. settling with no resume metadata (:1260): commit before external calls with no timeout recovery → a crash wedges the order in settling forever.

Plus: terminal state is inconsistent for the same "accept failed" symptom depending on whether the HTTP call raised vs returned an error body (Grok :1199-1249); confirm now always-200 with new settlement_recovery/matched states but no caller coordination; and the PR commits 3 binary SQLite DBs under tests/.tmp_signed_transfer/ (remove + gitignore — unauditable artifacts).

Direction: make terminal/refund commits happen only after the external action succeeds (durable outbox + reconciliation), don't finalize/release preimage until payout is confirmed, and add an explicit recovery+alert state on post-payout CAS failure. Happy to pair on the rework.

…precise escrow-released, 502 contract

Folds in the tri-brain loop-5/6 findings (Codex + Grok):

- match_order lost-race (409) and exception cleanup now use safe_refund_escrow
  too (all 5 refund sites consistent): a refund/alert failure can no longer
  turn the intended 409 into a 500 or double-refund in the outer handler.

- Precise escrow-release classification at /accept: a ConnectionError (refused/
  DNS/SSL/connect-timeout = never reached the node) keeps the settlement
  retryable ('matched'); a read-timeout / other RequestException after send
  (outcome unknown) marks it ambiguous -> 'settlement_recovery'. Previously the
  pre-set flag wedged even cleanly-not-sent failures into recovery.

- Enforce the 'released => never matched' invariant directly: the final-state
  else now keys on escrow_released, not just payout_status, so escrow that has
  (or may have) left never reverts to a retryable state that would dead-end on
  an already-accepted job.

- Failed settlement now returns HTTP 502 (upstream escrow/node failure), 200
  only on genuine completion — monitoring/intermediaries no longer read a failed
  financial op as success; frontend still branches on data.ok; retries stay safe.

Tests: +2 (accept read-timeout -> settlement_recovery; accept ConnectionError ->
matched). 55 otc tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@galanime galanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed otc-bridge/otc_bridge.py and tests/test_otc_bridge_settlement_atomicity.py with a focus on the new settlement recovery paths.

Two concrete observations:

  1. The PR currently adds three binary SQLite files under tests/.tmp_signed_transfer/. Those look like generated test artifacts rather than source fixtures. Keeping them in the branch makes code review and future rebases harder, and it risks non-deterministic churn if SQLite/page layout changes. I would drop them from the PR and generate any needed DB state inside the tests or via checked-in text fixtures.

  2. safe_refund_escrow() is the path operators will rely on when a post-commit refund fails, but the helper currently swallows the original exception with log.error(...) only. In a money-path recovery branch, I would strongly prefer log.exception(...) (or equivalent structured traceback capture) and passing the exception class/message into send_bridge_alert(...), otherwise the reconciliation breadcrumb is thinner than the state machine around it.

I received RTC compensation for this review.

@Scottcjn
Copy link
Copy Markdown
Owner Author

Scottcjn commented Jun 3, 2026

Marked ready — tri-brain converged (6 review loops)

Reviewed adversarially by Codex 5.5 + Grok (+ GPT-OSS where it finished) across 6 recurrent-depth loops. The open-finding count contracted every loop, which is the convergence signal:

Loop New in-scope findings
1 5 distinct in-request races
2 ~6 commit-ordering bugs (the fixes' own edges)
3 2 boundary refinements (accept-timeout point-of-no-return; expiry commit-order)
4 1 raise-safety nit + the architectural residual
5 2 (both consequences of loop-4's incomplete safe_refund_escrow rollout)
6 0 new bugs — 1 latent-fragility hardening + intentional/documented items

Last-loop hardening (loop 5–6)

  • match_order lost-race + exception cleanup now use the raise-proof safe_refund_escrow (all 5 refund sites consistent).
  • Precise escrow-release classification at /accept: ConnectionError (never reached node) → retryable matched; read-timeout/other (sent, unknown) → settlement_recovery. +2 tests.
  • Enforced "released ⇒ never matched" directly on escrow_released (not coincidentally via payout_status).
  • Failed settlement → HTTP 502 (200 only on completion), so monitoring doesn't read a failed financial op as success.

Reviewer attention — intentional behavior changes (not bugs)

  1. Confirm response contract: ok now reflects payout success; status ∈ {completed, settlement_recovery, matched}; trade_id/htlc_secret only on success; 502 on failure. The old contract reported completed/ok:true even when payout failed.
  2. Trades recorded only on genuine completion (volume/history no longer counts failed settlements).
  3. New statuses settling/settlement_recovery (no CHECK constraint on orders.status; surfaced by GET /api/orders/<id>, hidden from the open book).

Known limitation → follow-up (NOT in this PR)

A process crash between the settling-commit and the final commit can still wedge an order in settling. In-request recovery covers the exception case; the crash case needs a settling_at timestamp + a reconciler/sweeper. I'd take that as a separate PR.

55 otc tests pass. Your call on merge — and on whether the 502-on-failure contract is what you want for downstream consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BCOS-L1 Beacon Certified Open Source tier BCOS-L1 (required for non-doc PRs) size/XL PR: 500+ lines tests Test suite changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants