fix(otc): settlement atomicity — CAS guards, escrow cleanup, honest completion#6803
fix(otc): settlement atomicity — CAS guards, escrow cleanup, honest completion#6803Scottcjn wants to merge 2 commits into
Conversation
…ompletion Closes the settlement-atomicity race cluster from the otc-bridge tri-brain review (4 review loops: Codex + Grok + GPT-OSS). All money-path mutations now claim their state transition atomically BEFORE any irreversible escrow/payout call, and only act if they won the claim. match_order: - release the taker's escrow on the lost-CAS (409) and exception paths so a concurrent-match loser's RTC is never orphaned; - clear the escrow marker after a successful commit so the except handler can never refund a live matched order. cancel_order: - CAS open->cancelled and COMMIT before refunding, so a commit failure can't roll back to 'open' with escrow already gone; report escrow_refunded. auto-expire (list_orders + match): - guard each expiry with AND status='open' (no clobbering a just-matched row); - commit the 'expired' transition BEFORE refunding (two-phase). confirm_order: - atomic matched->settling claim before any external call (serializes double-confirm by design, not by accident of idempotency); - 3-way honest outcome: 'completed' only on real payout success (+ trade row + preimage), 'settlement_recovery' if escrow released but payout failed, revert to 'matched' if escrow untouched (retryable); - changes() guard on the completed UPDATE (no orphan trade); - exception recovery moves a committed 'settling' claim out to a recoverable state — never wedged. escrow_released marked BEFORE the accept call so an accept timeout is treated as ambiguous (-> settlement_recovery). safe_refund_escrow(): raise-proof post-commit refund helper (logs + alerts, never 500s the caller after a durable commit); used by expiry + cancel. Tests: tests/test_otc_bridge_settlement_atomicity.py (8) — payout-failure, escrow-failure-retryable, double-confirm rejected, mid-settlement exception recovery (both before/after release), cancel-of-matched no-refund. 53 otc tests pass. KNOWN LIMITATIONS (documented; separate follow-ups, NOT silently left): - a process crash BETWEEN a durable commit and the following external call (settling->accept, cancel-commit->refund) needs a reconciler + a settling_at timestamp; in-request recovery covers the exception case only. - payout 'ok' means QUEUED, not on-chain-confirmed (pre-existing optimistic completion; on-chain confirmation polling is a separate change). - confirm response shape change is INTENTIONAL: the old contract reported completed/ok:true even when payout failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
| Metric | Value |
|---|---|
| Trust Score | 52/100 |
| Certificate ID | BCOS-d3291212 |
| Tier | L1 (not met) |
What does this mean?
The BCOS (Beacon Certified Open Source) engine scans for:
- SPDX license header compliance
- Known CVE vulnerabilities (OSV database)
- Static analysis findings (Semgrep)
- SBOM completeness
- Dependency freshness
- Test infrastructure evidence
- Review attestation tier
BCOS v2 Engine - Free & Open Source (MIT) - Elyan Labs
jaxint
left a comment
There was a problem hiding this comment.
LGTM! Thanks for contributing to RustChain. Approved.
🔴 Tri-brain review — CAS guard helps, but settlement state machine still unsafe (do not merge yet)The compare-and-swap on
Plus: terminal state is inconsistent for the same "accept failed" symptom depending on whether the HTTP call raised vs returned an error body (Grok Direction: make terminal/refund commits happen only after the external action succeeds (durable outbox + reconciliation), don't finalize/release preimage until payout is confirmed, and add an explicit recovery+alert state on post-payout CAS failure. Happy to pair on the rework. |
…precise escrow-released, 502 contract
Folds in the tri-brain loop-5/6 findings (Codex + Grok):
- match_order lost-race (409) and exception cleanup now use safe_refund_escrow
too (all 5 refund sites consistent): a refund/alert failure can no longer
turn the intended 409 into a 500 or double-refund in the outer handler.
- Precise escrow-release classification at /accept: a ConnectionError (refused/
DNS/SSL/connect-timeout = never reached the node) keeps the settlement
retryable ('matched'); a read-timeout / other RequestException after send
(outcome unknown) marks it ambiguous -> 'settlement_recovery'. Previously the
pre-set flag wedged even cleanly-not-sent failures into recovery.
- Enforce the 'released => never matched' invariant directly: the final-state
else now keys on escrow_released, not just payout_status, so escrow that has
(or may have) left never reverts to a retryable state that would dead-end on
an already-accepted job.
- Failed settlement now returns HTTP 502 (upstream escrow/node failure), 200
only on genuine completion — monitoring/intermediaries no longer read a failed
financial op as success; frontend still branches on data.ok; retries stay safe.
Tests: +2 (accept read-timeout -> settlement_recovery; accept ConnectionError ->
matched). 55 otc tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
galanime
left a comment
There was a problem hiding this comment.
Reviewed otc-bridge/otc_bridge.py and tests/test_otc_bridge_settlement_atomicity.py with a focus on the new settlement recovery paths.
Two concrete observations:
-
The PR currently adds three binary SQLite files under
tests/.tmp_signed_transfer/. Those look like generated test artifacts rather than source fixtures. Keeping them in the branch makes code review and future rebases harder, and it risks non-deterministic churn if SQLite/page layout changes. I would drop them from the PR and generate any needed DB state inside the tests or via checked-in text fixtures. -
safe_refund_escrow()is the path operators will rely on when a post-commit refund fails, but the helper currently swallows the original exception withlog.error(...)only. In a money-path recovery branch, I would strongly preferlog.exception(...)(or equivalent structured traceback capture) and passing the exception class/message intosend_bridge_alert(...), otherwise the reconciliation breadcrumb is thinner than the state machine around it.
I received RTC compensation for this review.
Marked ready — tri-brain converged (6 review loops)Reviewed adversarially by Codex 5.5 + Grok (+ GPT-OSS where it finished) across 6 recurrent-depth loops. The open-finding count contracted every loop, which is the convergence signal:
Last-loop hardening (loop 5–6)
Reviewer attention — intentional behavior changes (not bugs)
Known limitation → follow-up (NOT in this PR)A process crash between the settling-commit and the final commit can still wedge an order in 55 otc tests pass. Your call on merge — and on whether the 502-on-failure contract is what you want for downstream consumers. |
The unifying fix
Every money-path mutation now claims its state transition atomically (compare-and-swap) before any irreversible escrow/payout call, and only acts if it won the claim.
match_ordercancel_orderopen→cancelledand commit before refunding — a commit failure can no longer roll back toopenwith escrow already gone (re-matchable with no escrow). Reportsescrow_refunded.auto-expire (
list_orders+match)AND status='open'— a stale expiry scan can't clobber/refund a just-matched row.expiredtransition before refunding (two-phase), same ordering as cancel.confirm_ordermatched→settlingclaim before any external call — serializes double-confirm by design, not by accident of payout idempotency.completedonly on real payout success (+ trade row + preimage);settlement_recoveryif escrow was released but payout failed (idempotent re-drive); revert tomatchedif escrow was never touched (retryable).changes()guard on the completed UPDATE → no orphan trade row.settlingclaim to a recoverable state — never wedged.escrow_releasedis marked before theacceptcall, so an accept timeout (funds may have moved, no response) is treated as ambiguous →settlement_recovery, never a retryablematchedthat could double-release.safe_refund_escrow()Raise-proof post-commit refund helper (logs + alerts, never 500s the caller after a durable commit). Used by the expiry and cancel paths.
Tests —
tests/test_otc_bridge_settlement_atomicity.py(8)payout-failure→
settlement_recovery(no trade,ok:false, 200); escrow-failure→matched(retryable); double-confirm rejected; mid-settlement exception recovery both before and after escrow release; cancel-of-matched doesn't refund. 53 otc tests pass.settling→accept; cancel-commit→refund) can still strand state/funds. In-request recovery covers the exception case; closing the crash case needs asettling_attimestamp + a reconciler/sweeper. Follow-up PR.ok= queued, not on-chain-confirmed — pre-existing optimistic completion. On-chain confirmation polling is a separate architectural change.completed/ok:trueeven when the payout failed (it lied). New:okreflects payout success,status∈ {completed,settlement_recovery,matched},trade_id/htlc_secretonly on success, still HTTP 200 (the frontend branches ondata.ok, not status code).🤖 Generated with Claude Code