Skip to content

fix transaction safety for hypergraph store writes#567

Merged
CassOnMars merged 5 commits into
QuilibriumNetwork:v2.1.0.24from
dazthecorgi:hypergraph-transaction-safety
Jun 15, 2026
Merged

fix transaction safety for hypergraph store writes#567
CassOnMars merged 5 commits into
QuilibriumNetwork:v2.1.0.24from
dazthecorgi:hypergraph-transaction-safety

Conversation

@dazthecorgi

Copy link
Copy Markdown
Contributor

A vertex-data persistence test failure surfaced a broader problem where hypergraph writes could escape their surrounding transaction: persisting for uncommitted frames, leaking via a read path, or leaving the in-memory tree out of sync with the store on retry.

  • fix vertex data roundtrip test and add new one aa5274b
  • use tx when saving vertex data db43992
    • thread the txn through save_vertex_underlying so that vertex data isn't persisted if the tx reverts.
  • make LazyVectorCommitmentTree::commit retry-safe c6c6398:
    • defer clearing dirty state until txn.commit() succeeds, so an aborted txn leaves the tree dirty for a safe retry (covers all staged writes, not just vertex data).
  • make compute_shard_root read-only 61db774:
    • extract compute_root() so root comparison on the hot master-stream path stops leaking writes via the NoopTxn fallback.
  • require RocksTxn for store writes 4b61ab5:
    • remove the silent direct-write fallback that masked the above bug; all writers now use the concrete RocksTxn batch and error loudly on an unrecognized txn.

See individual commit messages for full detail.

dazthecorgi and others added 5 commits June 14, 2026 17:18
Step 1 — Failing regression test (TDD)
Added get_vertex_data_not_visible_when_commit_txn_aborted to
crates/quil-rpc/tests/vertex_data_end_to_end.rs. It drives the real
lazy-commit path into a transaction, aborts it, and asserts GetVertexData
returns nothing. Confirmed it failed on current code before any fix.

Step 2 — save_vertex_underlying is now transaction-aware
- crates/quil-types/src/store.rs: added txn: &dyn Transaction to the trait
  method.
- crates/quil-store/src/hypergraph.rs: added inherent
  save_vertex_underlying_txn / save_tree_blob_txn using the existing
  with_rocks_batch pattern (stage into the batch, fall back to a direct
  write for unrecognized txns); the trait impl delegates to the txn
  variant. Kept the original inherent direct-write methods intact so the
  ~25 concrete-typed callers are untouched.

Step 3 — Threaded txn through the lazy-tree commit
crates/quil-tries/src/lazy_tree.rs: walk_leaves_persist now takes txn,
passes it recursively and into save_vertex_underlying; the call site passes
the txn that commit already receives. Leaf writes now join the same batch
as the tree nodes and shard commit.

Step 4 — Sync-apply path made atomic
crates/quil-rpc/src/hypergraph_sync_probe.rs: extracted a
persist_shard_refresh helper that wraps the tree-blob write and the
per-vertex loop in a single transaction (new_transaction → writes →
commit); on any error the txn is dropped, so partial syncs don't persist.
Used in both the fresh and incremental paths.

Step 5 — Implementors updated & verified
Added the _txn param to the four other HypergraphStore impls (BTreeStore,
MemStore, E2EHgStore, the quil-execution test store).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make LazyVectorCommitmentTree::commit retry-safe by deferring dirty-state
clearing until the surrounding transaction is durably committed.

Problem: commit drained its dirty-node map and pending_deletions and
cleared dirty_flag before returning — i.e. before the caller's
txn.commit(). If the txn was aborted/failed/never committed, the store
stayed unchanged while the in-memory tree (which is reused across frames)
believed it had persisted, so a retry silently skipped re-staging the
missing writes.

Fix:
- crates/quil-tries/src/lazy_tree.rs: commit now takes a non-draining
  snapshot of the dirty set and pending_deletions to stage into the txn,
  and no longer clears dirty_flag. Removed the dead latest map. Added
  mark_persisted() — the sole place dirty bookkeeping is cleared; must be
  called only after txn.commit() succeeds. Idempotent.
- crates/quil-hypergraph/src/crdt.rs: HypergraphCRDT::commit collects the
  trees it staged and calls mark_persisted() on each only after
  txn.commit()? succeeds. A failed commit returns early, leaving every
  tree dirty for a safe retry.
- Side benefit: compute_shard_root (which commits under a NoopTxn) no
  longer clears dirty state, so the subsequent real commit re-stages those
  nodes transactionally.

Tests:
- crates/quil-tries/tests/golden_lazy_tree.rs — added
  commit_defers_dirty_clear_until_mark_persisted: proves the tree stays
  dirty after commit, that a retry after a discarded txn re-stages node
  writes, and that a clean tree (post mark_persisted) stages nothing.
  Added kv_len/clear_kv test helpers.
- crates/quil-rpc/tests/vertex_data_end_to_end.rs — folded the is_dirty()
  lifecycle into the existing real-RocksDB commit-path tests:
  - success path now asserts dirty after txn.commit(), clean after
    mark_persisted();
  - abort path now asserts the tree stays dirty (retry-safe).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compute_shard_root claimed to read "without writing anything to the store"
but computed its root via tree.commit(&NoopTxn, ...). Since NoopTxn isn't a
RocksTxn, every write inside commit (tree nodes, deletions, and vertex
underlying blobs) fell through to a direct db.put, persisting data outside
any frame transaction for an uncommitted frame. This runs on the hot
master-stream receive path (maybe_sync_before_global_frame), so a routine
root comparison leaked writes to disk.

- quil-tries: extract LazyVectorCommitmentTree::compute_root(prover), the
  read-only half of commit — recomputes commitments in memory, returns the
  root, writes nothing and touches no dirty/deletion/dirty-flag state.
  Refactor commit to delegate to it.
- quil-hypergraph: compute_shard_root now calls tree.compute_root(); deleted
  the inline NoopTxn shim and the doubled doc comment.
- Add quil-hypergraph test compute_shard_root_is_read_only (with MemStore
  node_count/per_vertex_count accessors and a dirty-state helper): asserts a
  real root is returned, nothing is persisted, the tree stays dirty, and the
  root matches what a real commit then produces.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With compute_shard_root no longer passing a NoopTxn, no caller ever hands a
non-RocksTxn to a RocksHypergraphStore write, so the silent direct-write
fallback only served to mask bugs like the one compute_shard_root hit.

Remove the with_rocks_batch helper. Replace it with
RocksTxn::from_dyn(&dyn Transaction) -> Result<&RocksTxn>, which downcasts
once and errors loudly on an unrecognized txn instead of writing outside the
transaction. Rewrite all writers (save_tree_blob_txn, save_vertex_underlying_txn,
insert_node, save_root, delete_node, set_shard_commit, set_alt_shard_commit,
track_change, untrack_change) to operate on the concrete RocksTxn batch
directly. The HypergraphStore trait signatures stay &dyn Transaction for
object-safety (five implementors, used as Arc<dyn HypergraphStore>).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +1063 to +1065
/// transaction. Either everything commits or (on error) nothing does —
/// the txn is dropped, which aborts the batch. Returns the number of
/// per-vertex blobs staged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is a change in behavior. previously partial failures were tolerated

@CassOnMars CassOnMars merged commit 435e967 into QuilibriumNetwork:v2.1.0.24 Jun 15, 2026
5 checks passed
@dazthecorgi dazthecorgi deleted the hypergraph-transaction-safety branch June 16, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants