Skip to content

test(DON'T MERGE): develop-v2.0.0-rc.1#2592

Draft
shuklaayush wants to merge 106 commits intodevelop-v2.0.0-betafrom
develop-v2.0.0-rc.1
Draft

test(DON'T MERGE): develop-v2.0.0-rc.1#2592
shuklaayush wants to merge 106 commits intodevelop-v2.0.0-betafrom
develop-v2.0.0-rc.1

Conversation

@shuklaayush
Copy link
Copy Markdown
Collaborator

No description provided.

@shuklaayush shuklaayush changed the title Develop v2.0.0 rc.1 test(DON'T MERGE): develop-v2.0.0-rc.1 Mar 19, 2026
@shuklaayush shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from e9d2afb to deff2f2 Compare March 19, 2026 18:20
@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch 2 times, most recently from a6c7835 to 8d51084 Compare March 19, 2026 21:05
@shuklaayush shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from 8d51084 to 743b94e Compare March 19, 2026 22:48
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 19, 2026

Claude finished @shuklaayush's task in 3m 55s —— View job


Code Review for PR #2592

This is a large PR (100 files, +31590/-16895) covering several major changes. Below is a detailed review organized by area.


Overview of Changes

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Well-designed Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with some safety and correctness concerns
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Consistent inventory.cu properly replaces access_adapters.cu
CI/benchmarks ℹ️ Note Benchmark CI disabled (intentional TODO)

SHA2-air Crate (replaces SHA256-air)

Verdict: Excellent work. The generalization from SHA-256 to the full SHA-2 family is complete and correct.

  • Config system (crates/circuits/sha2-air/src/config.rs): Clean trait-based abstraction with Sha256Config, Sha512Config, Sha384Config. All NIST FIPS 180-4 constants verified correct (rotation amounts, K values, H values).
  • AIR constraints (crates/circuits/sha2-air/src/air.rs): Proper constraint degree management (≤3), correct carry propagation, endianness handling consistent.
  • SOUNDNESS.md: Excellent documentation of the constraint scheme, particularly the "dummy value" approach to avoid conditional constraints.
  • Trace generation (crates/circuits/sha2-air/src/trace.rs): Two-pass algorithm is well-documented. Index bounds verified correct for all SHA-2 variants.

No critical issues found in this area.


ColsRef Derive Macro

Verdict: Functional but has several issues worth addressing.

Issues found:

  1. Copy-paste bug in panic message (crates/circuits/primitives/derive/src/cols_ref/mod.rs:450):

    panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
    // ^ Should say "get_const_cols_ref_fields"
  2. Unsafe pointer casting without alignment verification (lines 341, 506): The generated code casts &[T] to &[AlignedBorrowType] without verifying alignment. If the AlignedBorrow type has alignment requirements different from T, this could cause UB.

  3. width() returns 0 for primitive types (lines 376-378, 541-543): The comment says "calling the width() method is undefined if the struct has a primitive type", but nothing prevents calling it. This would silently corrupt subsequent field parsing.

  4. Significant code duplication (lines 292-454 vs 457-619): get_const_cols_ref_fields and get_mut_cols_ref_fields are ~150 lines each and nearly identical. Consider a macro or shared helper to reduce maintenance burden.

  5. Tests lack assertions (tests/example.rs:24-28): The example test prints values but doesn't verify them:

    println!("{}, {}", test.arr, test.sum);  // no assertions!
  6. Missing edge case test coverage (tests/test_cols_ref.rs): No tests for wrong-sized input slices, misaligned data, or deeply nested arrays (3D+).


Memory System Refactoring

Verdict: Clean and complete refactoring.

  • Volatile memory (crates/vm/src/system/memory/volatile/) and access adapters (crates/vm/src/system/memory/adapter/) fully removed with no dangling references.
  • New PersistentBoundaryAir<CHUNK> (crates/vm/src/system/memory/persistent.rs) with per-block timestamps is well-designed.
  • Memory chunking math verified correct: CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2, all index bounds within range.

One item to verify (persistent.rs:94-98):

// Constrain that an 'initial' row has all timestamp zero.
// Formula: direction * (direction + 1) selects direction == 1
let mut when_initial =
    builder.when(local.expand_direction * (local.expand_direction + AB::F::ONE));

The comment says this selects direction == 1, but 1 * (1+1) = 2 ≠ 0. The formula is zero when direction ∈ {0, -1}. Tests pass so the builder semantics may work differently than expected, but the comment is misleading and should be clarified.


VM Architecture & CUDA Changes

Verdict: All correct and consistent.

  • Config simplification (crates/vm/src/arch/config.rs): block_size removed from config, replaced with DEFAULT_BLOCK_SIZE constant. SystemConfig cleanly unified.
  • constant_trace_height trait method (crates/circuits/primitives/src/chip.rs:14-20): Properly implemented across all chips. MeteredCtx correctly tracks constant vs variable heights to prevent unnecessary segment resets.
  • inventory.cu (crates/vm/cuda/src/system/inventory.cu): Clean two-pass algorithm (build candidates → scatter compact) replacing the old access_adapters approach. Proper Montgomery encoding documented.
  • Testing infrastructure properly supports both CPU and GPU backends.

CI & Benchmark Changes

  • Benchmark CI workflow is intentionally disabled with TODO comment: "re-enable after benchmarks are ported to v2 (native extensions removed)". Only workflow_dispatch and workflow_call remain active.
  • SHA2 rename consistently applied across all benchmark code (openvm_sha2_circuit, openvm_sha2_transpiler, openvm_sha2::Sha256).
  • SDK config properly maps sha256 field to Sha2TranspilerExtension.

Summary

This is a well-executed PR with solid architectural changes. The main areas for improvement are in the ColsRef derive macro (safety, testing, deduplication) and a misleading comment in the persistent memory constraint logic. The SHA2-air generalization and memory system refactoring are particularly well done.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 20, 2026

Claude finished @shuklaayush's task in 6m 42s —— View job


Code Review for PR #2592

This is a large PR (100 files, +31590/-16895) covering several major changes. Below is a detailed review organized by area.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Well-designed Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with safety, correctness, and duplication concerns
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Consistent inventory.cu properly replaces access_adapters.cu
CI/benchmarks ℹ️ Note Benchmark CI disabled (intentional TODO)

SHA2-air Crate (replaces SHA256-air)

Verdict: Excellent work. No correctness or soundness issues found.

  • Constants (crates/circuits/sha2-air/src/config.rs): All K, H, and rotation/shift values verified correct against FIPS 180-4 for SHA-256, SHA-384, and SHA-512.
  • Constraint degree (crates/circuits/sha2-air/src/air.rs): Properly managed at ≤3 via slack values on non-applicable rows. Well documented in SOUNDNESS.md.
  • Carry propagation (crates/circuits/sha2-air/src/trace.rs:147-160): Correct 2-bit decomposition for 4-term additions with proper accumulation.
  • Array indexing: Verified all index bounds for ROUNDS_PER_ROW=4 and HASH_WORDS=8 — no off-by-one errors.
  • Edge cases: First block initialization, last block handling, padding rows, and block transitions all properly handled.
  • SOUNDNESS.md: Thorough documentation of the constraint scheme. The "dummy value" approach to avoid conditional constraints is well justified.

No issues found in this area.


ColsRef Derive Macro

Verdict: Functional but has several issues worth addressing.

Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450:

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

Safety: Unsafe pointer casting without alignment verification

crates/circuits/primitives/derive/src/cols_ref/mod.rs:341,506: The generated code casts &[T] to &[AlignedBorrowType] without compile-time guarantees that size_of::<T>() is divisible by size_of::<AlignedBorrowType>(). This relies on runtime debug_asserts in AlignedBorrow::borrow(), which are stripped in release builds — misuse could cause UB.

High duplication between const/mut paths

crates/circuits/primitives/derive/src/cols_ref/mod.rs:292-453 vs 457-619: get_const_cols_ref_fields and get_mut_cols_ref_fields are ~160 lines each and ~95% identical. The only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. Consider a parameterized helper to reduce maintenance burden.

width() is unsafe for structs with primitive types

crates/circuits/primitives/derive/src/cols_ref/mod.rs:376-378: For primitive type fields, length_expr is set to 0, making width() silently return wrong values. There's no compile-time check preventing this — it's documented in a comment but nothing enforces it.

Test coverage gaps

  • tests/example.rs:24-27: Test prints values but has no assertions
  • tests/test_cols_ref.rs: Missing tests for:
    • from_mut() correctness
    • Wrong-sized input slices
    • 3D+ arrays, empty arrays, single-element arrays
    • Width consistency verification (that width() matches actual slice consumption)

Memory System Refactoring

Verdict: Clean and complete. No soundness issues.

  • Volatile memory and access adapters fully removed — no dangling references found.
  • Direction constraint (persistent.rs:86-89): assert_eq(d, d*d*d) correctly constrains d ∈ {-1, 0, 1}.
  • Initial row timestamp constraint (persistent.rs:94-98): when(d * (d + 1)) correctly enables only when d = 1 (evaluates to 2, which is truthy). For d = -1: (-1) × 0 = 0 (disabled). For d = 0: 0 × 1 = 0 (disabled). This is correct — the previous review's concern was unfounded.
  • Compression bus interaction (persistent.rs:118): d * d correctly enables for both d = 1 and d = -1, disables for d = 0.
  • Memory chunking math: CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2 — all index calculations verified correct with no off-by-one errors.
  • Untouched block handling (persistent.rs:281-293): Correctly retains initial values and timestamp 0 for untouched blocks within partially-touched chunks.

VM Architecture & CUDA Changes

Verdict: All correct and consistent.

  • Config simplification (crates/vm/src/arch/config.rs:111-113): block_size removed, replaced with DEFAULT_BLOCK_SIZE = 4 constant. No lingering dynamic references found.

  • constant_trace_height trait (crates/circuits/primitives/src/chip.rs:14-20): Clean default None implementation. Properly collected in extensions.rs:586-596 with correct reverse ordering to match verifying key AIR order.

  • MeteredCtx (crates/vm/src/arch/execution_mode/metered/ctx.rs:38-48): Correctly unzips constant heights into (trace_heights, is_trace_height_constant) vectors. Segmentation logic properly skips padding checks for constant heights. reset_trace_heights() conditionally subtracts only variable heights.

  • CUDA inventory.cu: Two-pass algorithm (build candidates → prefix sum → scatter compact) is correct:

    • Pass 1 properly sets flags for new blocks and handles consecutive records in same chunk
    • CUB ExclusiveSum properly synchronized with cudaStreamPerThread
    • Pass 2 scatter uses positions from prefix sum, final thread writes output count
    • Montgomery encoding consistently applied and documented
  • Thread safety: All kernel launches properly error-checked with CHECK_KERNEL(). Stream synchronization correct throughout.


CI & Benchmark Changes

  • Benchmark CI workflow intentionally disabled with clear TODO: "re-enable after benchmarks are ported to v2 (native extensions removed)".
  • SHA2 rename consistently applied across benchmark code (openvm_sha2_circuit, openvm_sha2_transpiler, Sha2).
  • Minor typo at benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache".

Summary

This is a well-executed PR with solid architectural changes. The SHA2-air generalization, memory system refactoring, and VM config simplification are all sound and well-designed.

Items to address:

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450)
  2. Safety: Consider adding compile-time alignment assertions for unsafe casts in ColsRef macro
  3. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines)
  4. Testing: Add assertions and edge case coverage for ColsRef macro tests
  • Gather context and understand changes
  • Review SHA2-air crate
  • Review ColsRef derive macro
  • Review memory system refactoring
  • Review VM architecture changes
  • Review CUDA/GPU changes
  • Review CI/benchmark changes
  • Post final review

@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch 2 times, most recently from 3b2b382 to 4911c15 Compare March 20, 2026 20:13
@jpw-axiom jpw-axiom force-pushed the develop-v2.0.0-rc.1 branch from 4911c15 to cf270ee Compare March 21, 2026 04:02
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 21, 2026

Claude finished @jpw-axiom's task in 3m 2s —— View job


Code Review for PR #2592 (latest sync)

This is a follow-up review after new commits were pushed. The PR now includes 46 commits across 100 files (+31496/-16874). I've reviewed all changes against the develop-v2.0.0-beta base, with particular focus on the 4 new commits since the last review.


New Commits Since Last Review

Commit Summary Status
cf270ee fix: limit cuda threads in xorin ✅ Correct
efa21c6 chore: fix lock file ✅ Housekeeping
666a706 fix: implement constant_trace_height for DeferralCircuitCountChip ✅ Correct
fe91db9 fix: rebase artifacts ✅ Cleanup

New Change: constant_trace_height for DeferralCircuitCountChip

extensions/deferral/circuit/src/count/trace.rs:41-43:

fn constant_trace_height(&self) -> Option<usize> {
    Some(next_power_of_two_or_zero(self.count.len()))
}

Verdict: Correct. The height is deterministic based on the number of deferral circuits (fixed at config time). The returned value matches the height used in generate_proving_ctx (same next_power_of_two_or_zero(self.count.len()) call), ensuring consistency. This correctly integrates with the MeteredCtx system which skips segmentation checks for constant-height chips.


New Change: CUDA Thread Limit in Xorin

extensions/keccak256/circuit/cuda/src/xorin.cu:190:

auto [grid, block] = kernel_launch_params(height, 512);

Verdict: Correct. Limits block size to 512 threads to reduce per-thread stack pressure. Consistent with other high-stack CUDA kernels in the codebase (merkle_tree.cu, eq_3b.cu, symbolic_expression.cu all use 256-512 limits).


New Change: Deferral Adapter Cleanup

extensions/deferral/circuit/src/adapters.rs — Updated SAFETY comments to remove stale "minimum alignment of 1" references, and simplified read/write calls to remove the now-unnecessary alignment const generic parameter. The deleted file extensions/rv32im/circuit/src/adapters/deferral.rs was a duplicate of functionality already in the deferral extension's own adapters.rs.

Verdict: Clean cleanup, no issues.


Outstanding Issues from Full PR (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450:

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ This is inside get_const_cols_ref_fields — should say "get_const_cols_ref_fields"

Fix this →

2. ColsRef derive macro: high code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~85% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. Consider a parameterized helper or macro to reduce maintenance burden.

3. ColsRef derive macro: test gaps

  • tests/example.rs:24-27 uses println! with no assertions
  • tests/test_cols_ref.rs lacks tests for: from_mut() correctness, wrong-sized input slices, deeply nested arrays (3D+)

4. Minor typo

benchmarks/execute/benches/execute.rs:91: "Cachce""Cache"


Areas Verified Sound (No Changes Needed)

  • SHA2-air crate: All FIPS 180-4 constants correct. Constraint degree properly managed at ≤3. SOUNDNESS.md is thorough.
  • Memory system refactoring: Direction constraint logic in persistent.rs:86-98 is correct. d * (d + 1) correctly selects only d == 1 (evaluates to 2 = truthy; for d=-1: 0, d=0: 0).
  • VM config simplification: DEFAULT_BLOCK_SIZE = 4 consistently replaces dynamic block_size.
  • CUDA inventory.cu: Two-pass algorithm (build candidates → prefix sum → scatter compact) is correct with proper Montgomery encoding.
  • MeteredCtx integration: Constant trace heights correctly separated from variable heights in segmentation logic.

Summary

The 4 new commits are all correct and well-targeted fixes. The overall PR remains well-executed with solid architectural changes. The only actionable items are the minor copy-paste bug in the ColsRef panic message, code duplication in the derive macro, and a typo in benchmarks.

  • Gather context on latest changes
  • Review new commits in detail
  • Verify outstanding issues from prior reviews
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 22, 2026

Claude finished @shuklaayush's task in 3s —— View job


Code Review for PR #2592

This is a review after the latest sync (48 commits, 100 files, +31589/-16894). This review covers all changes against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Well-designed Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Consistent inventory.cu properly replaces access_adapters.cu
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO), guest programs reference missing types

New Commits Since Last Review

Commit Summary Status
ffb1783 chore: bump aws crates ✅ Lock file only
cf270ee fix: limit cuda threads in xorin ✅ Correct (512 thread limit)
efa21c6 chore: fix lock file ✅ Housekeeping
666a706 fix: constant_trace_height for DeferralCircuitCountChip ✅ Correct
fe91db9 fix: rebase artifacts ✅ Cleanup

All new commits are correct. The constant_trace_height implementation in extensions/deferral/circuit/src/count/trace.rs:41-43 returns Some(next_power_of_two_or_zero(self.count.len())), which matches the height computed in generate_proving_ctx (line 47). The xorin CUDA thread limit at extensions/keccak256/circuit/cuda/src/xorin.cu:190 is consistent with other high-stack kernels in the codebase.


SHA2-air Crate

Verdict: Excellent. No correctness or soundness issues found.

  • FIPS 180-4 constants (K, H, rotation/shift values) verified correct for SHA-256, SHA-384, and SHA-512
  • Constraint degree properly managed at ≤3 via slack values
  • SOUNDNESS.md provides thorough documentation of the constraint scheme
  • Carry propagation and index bounds verified correct for all variants

ColsRef Derive Macro

Verdict: Functional but has issues to address.

Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (which spans lines 292-454), but the message incorrectly says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

Safety concern: Unsafe pointer cast without alignment verification

crates/circuits/primitives/derive/src/cols_ref/mod.rs:341: The generated code casts &[T] to &[#elem_type] via raw pointer:

let #slice_var: &[#elem_type] = unsafe { &*(#slice_var as *const [T] as *const [#elem_type]) };

This relies on runtime debug_asserts in the AlignedBorrow trait which are stripped in release builds. Consider adding a compile-time static_assert that size_of::<T>() is compatible with size_of::<#elem_type>().

High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. The only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or macro would reduce maintenance burden.

Test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27:

println!("{}, {}", test.arr, test.sum);  // no assertions

At minimum, assert that test.sum == &15 and test.arr.len() == 5.


Memory System Refactoring

Verdict: Clean and complete.

  • Volatile memory and access adapters fully removed with no dangling references
  • Direction constraint (persistent.rs:86-89): assert_eq(d, d³) correctly constrains d ∈ {-1, 0, 1}
  • Initial row selection (persistent.rs:94-95): when(d * (d + 1)) correctly enables only for d = 1 (evaluates to 2, truthy) while disabling for d = -1 (→ 0) and d = 0 (→ 0)
  • Compression bus (persistent.rs): d * d correctly enables for both d = ±1 and disables for d = 0
  • Memory chunking math (CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2) verified correct

VM Architecture & CUDA Changes

Verdict: All correct and consistent.

  • Config simplification: block_size removed from SystemConfig, replaced with DEFAULT_BLOCK_SIZE = 4 constant. No lingering dynamic references
  • constant_trace_height trait (crates/circuits/primitives/src/chip.rs): Clean default None implementation. Correctly collected in extensions.rs with reverse ordering to match verifying key AIR order
  • MeteredCtx (crates/vm/src/arch/execution_mode/metered/ctx.rs): Correctly separates constant vs variable trace heights. Segmentation logic properly skips padding checks for constant heights
  • CUDA inventory.cu: Two-pass algorithm (build candidates → prefix sum → scatter compact) is correct. Proper Montgomery encoding, CUB synchronization, and error checking throughout
  • Deferral adapter cleanup: SAFETY comments properly updated, alignment const generic removed

CI & Benchmark Changes

  • Benchmark crates commented out of workspace members (Cargo.toml:12-15) with TODO to port to v2
  • Benchmark CI workflow disabled, only workflow_dispatch and workflow_call remain
  • Note: Benchmark guest programs (benchmarks/guest/sha256/src/main.rs, benchmarks/guest/sha256_iter/src/main.rs) reference openvm_sha2::Sha256 which doesn't exist in the codebase yet — expected since these are disabled
  • Minor typo at benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache"

Summary

This is a well-executed PR with solid architectural changes. The SHA2-air generalization, memory system refactoring, and VM config simplification are all sound. The new commits since the last review are all correct and well-targeted.

Items to address:

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450) — wrong function name in error
  2. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro
  3. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines of near-identical code)
  4. Testing: Add assertions to ColsRef example test
  5. Minor: Fix "Cachce" typo in benchmarks (low priority since benchmarks are disabled)
  • Gather context on latest changes
  • Review new commits since last review
  • Review SHA2-air crate changes
  • Review ColsRef derive macro
  • Review memory system refactoring
  • Review VM architecture & CUDA changes
  • Review CI/benchmark changes
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 22, 2026

Claude finished @shuklaayush's task in 3m 30s —— View job


Code Review for PR #2592

This is a review after the latest sync (48 commits, 100 files, +31609/-17094). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Consistent inventory.cu properly replaces access_adapters.cu
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

SHA2-air Crate

Verdict: Excellent. No correctness or soundness issues found.

  • FIPS 180-4 constants (config.rs): All K, H, rotation/shift values verified correct for SHA-256, SHA-384, and SHA-512.
  • Constraint degree (air.rs): Properly managed at ≤3 via slack values. Comments at lines 430-431, 473, 564-565 explicitly document degree constraints.
  • Carry propagation (trace.rs:147-160, utils.rs:277-305): Correct 2-bit decomposition for 4-term additions with proper limb-by-limb carry chain.
  • SOUNDNESS.md: Thorough documentation of the constraint scheme. The "dummy value" approach is well justified.

ColsRef Derive Macro

Verdict: Functional but has several issues.

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees that size_of::<T>() is compatible with size_of::<#elem_type>(). This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. Consider adding a compile-time const_assert! for size compatibility.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. Test gaps

  • tests/example.rs:24-27: Uses println! with no assertions — should at minimum verify expected values
  • tests/test_cols_ref.rs: Good structural coverage but missing tests for from_mut() correctness, wrong-sized input slices, and width consistency verification

Memory System Refactoring

Verdict: Clean and complete.

  • Volatile memory and access adapters fully removed with no dangling references.
  • Direction constraint (persistent.rs:85-89): assert_eq(d, d³) correctly constrains d ∈ {-1, 0, 1}.
  • Initial row selection (persistent.rs:91-98): when(d * (d + 1)) correctly enables only for d = 1 (evaluates to 2, truthy). For d = -1: (-1) × 0 = 0 (disabled). For d = 0: 0 × 1 = 0 (disabled).
  • Compression bus (persistent.rs): d * d correctly enables for both d = ±1 and disables for d = 0.
  • Memory chunking (CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2): All index calculations verified correct — block_idx = (ptr % 8) / 4 ∈ [0, 1], no off-by-one errors.

VM Architecture & CUDA Changes

Verdict: All correct and consistent.

  • Config simplification (config.rs:113): DEFAULT_BLOCK_SIZE = 4 correctly replaces dynamic block_size. No lingering dynamic references.
  • constant_trace_height trait (chip.rs:14-20): Clean default None implementation. Collection in extensions.rs:586-596 correctly reverses chip order to match verifying key AIR order.
  • MeteredCtx (ctx.rs:38-48): Correctly unzips constant heights into (trace_heights, is_trace_height_constant) vectors. Segmentation logic properly skips padding checks for constant heights.
  • DeferralCircuitCountChip (count/trace.rs:41-43): Returns Some(next_power_of_two_or_zero(self.count.len())), matching the height in generate_proving_ctx. Correct integration with MeteredCtx.
  • CUDA inventory.cu: Two-pass algorithm (build candidates → CUB exclusive sum → scatter compact) is correct. Proper Montgomery encoding, flag-based filtering, and CHECK_KERNEL() error checking.
  • CUDA boundary.cu: Poseidon2Buffer uses atomicAdd for thread-safe slot allocation. Each pre-image entry is independent, so non-deterministic ordering is correct.
  • Xorin thread limit (xorin.cu:190): 512-thread limit consistent with other high-stack CUDA kernels.

CI & Benchmark Changes

  • Benchmark crates commented out of workspace members with TODO to port to v2.
  • Benchmark CI workflow intentionally disabled (only workflow_dispatch and workflow_call remain).
  • Minor typo at benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache" (low priority since disabled).

Summary

This is a well-executed PR with solid architectural changes. The SHA2-air generalization, memory system refactoring, and VM config simplification are all sound and well-designed.

Items to address:

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450) — wrong function name in error string
  2. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro (mod.rs:341, mod.rs:506)
  3. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines of near-identical code)
  4. Testing: Add assertions to ColsRef example test and edge case coverage
  5. Minor: Fix "Cachce" typo in benchmarks (low priority)
  • Gather context on latest changes since last review
  • Review new/changed files
  • Check outstanding issues from prior reviews
  • Post final review

@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch from 71d3584 to f9268e6 Compare March 23, 2026 13:39
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Claude finished @shuklaayush's task in 2m 34s —— View job


Code Review for PR #2592

This is a review after the latest sync (48 commits, 100 files, +31714/-16972). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase optimization
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

New Commit: Keccakf CUDA Two-Phase Optimization (#2599)

Verdict: Excellent optimization, no correctness issues found.

The redesign from single-kernel (thread-per-permutation) to two-phase (compute then write) is well-motivated and correctly implemented:

  • Phase 1 (keccakf_perm_phase1, 128 threads/block): One thread per permutation runs all 24 keccak-f rounds via apply_round_in_place, storing 25-lane u64 round-input states to scratch. Low thread count appropriate for the high register pressure (~25 u64 state + loop variables).

  • Phase 2 (keccakf_perm_phase2, 256 threads/block): One thread per row loads cached round state and recomputes trace columns via generate_trace_row_for_round. Adjacent threads write adjacent rows, achieving coalesced column-major stores.

Verified correct:

  • Rho/pi cycle constants (RHO_PI_CYCLE_IDX/RHO_PI_CYCLE_ROT): Spot-checked several entries — cycle[0]=1 (position (1,0), R[1][0]=1 matches rot[23]=1), cycle[1]=6 (position (1,1), R[1][1]=44 matches rot[0]=44), etc. The in-place cycle permutation with single temp correctly applies rho+pi.
  • Chi in-place with 2 temps: Saves [0] and [1], modifies [0..4] in order — each step uses only original values. Standard optimization, verified correct.
  • apply_round_in_place matches generate_trace_row_for_round: Both have identical theta→rho/pi→chi→iota logic; the trace function additionally writes intermediate values to columns.
  • Preimage columns (phase 2, line 114): Correctly reads from round 0 offset (perm_idx * NUM_ROUNDS * 25), which is the permutation input state.
  • Dummy rows: Zero-initialized state processed through keccak-f produces deterministic valid trace data.
  • Scratch buffer sizing: Rust allocates blocks_to_fill * NUM_ROUNDS * 25 u64s; CUDA asserts round_state_words >= blocks_to_fill * 24 * 25. Consistent.
  • __noinline__ on device functions: Good practice to manage register pressure in high-register kernels.

The ~1.9% scratch overhead (4800 bytes/perm vs 252864 bytes/perm for trace) is negligible.


Outstanding Issues (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. ColsRef derive macro: high code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

3. ColsRef derive macro: test gaps

  • tests/example.rs:24-27 uses println! with no assertions
  • tests/test_cols_ref.rs lacks tests for from_mut() correctness and wrong-sized input slices

4. Minor typo

benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache" (low priority since benchmarks are disabled)


Summary

The new keccakf CUDA two-phase optimization is a well-designed and correctly implemented performance improvement. The architectural separation between compute (phase 1) and trace-write (phase 2) cleanly solves the memory coalescing bottleneck while adding minimal scratch overhead.

The overall PR remains well-executed. The only actionable items are the minor copy-paste bug in the ColsRef panic message, code duplication in the derive macro, and test coverage gaps.

  • Gather context on latest changes
  • Review new keccakf CUDA optimization commit
  • Check outstanding issues from prior reviews
  • Post final review

@shuklaayush shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from 271868d to 7a23ceb Compare March 23, 2026 16:55
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Claude finished @shuklaayush's task in 2m 1s —— View job


Code Review for PR #2592

This is a review after the latest sync (48 commits, 100 files, +31714/-16983). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase optimization
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

SHA2-air Crate

Verdict: Excellent. No correctness or soundness issues found. FIPS 180-4 constants verified, constraint degree managed at ≤3, SOUNDNESS.md is thorough.


ColsRef Derive Macro

Verdict: Functional but has several issues.

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer:

let #slice_var: &[#elem_type] = unsafe { &*(#slice_var as *const [T] as *const [#elem_type]) };

This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. Test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27:

println!("{}, {}", test.arr, test.sum);  // no assertions

At minimum, assert that test.sum == &15 and test.arr.len() == 5.


Memory System Refactoring

Verdict: Clean and complete.

  • Volatile memory and access adapters fully removed with no dangling references.
  • Direction constraint (persistent.rs:85-89): assert_eq(d, d³) correctly constrains d ∈ {-1, 0, 1}.
  • Initial row selection (persistent.rs:91-98): when(d * (d + 1)) correctly enables only for d = 1 (evaluates to 2, truthy). For d = -1: (-1) × 0 = 0 (disabled). For d = 0: 0 × 1 = 0 (disabled).
  • Compression bus (persistent.rs:118): d * d correctly enables for both d = ±1 and disables for d = 0.
  • Memory chunking (CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2): All index calculations verified correct.
  • Per-block timestamps (persistent.rs:49-52, persistent.rs:286-293): Untouched blocks within a touched chunk correctly retain initial values and timestamp 0, which balances the boundary bus interaction (init sends at t=0, final receives at t=0).

VM Architecture & CUDA Changes

Verdict: All correct and consistent.

  • constant_trace_height trait (chip.rs:14-20): Clean default None. Collection in extensions.rs:586-596 correctly uses .rev() to match verifying key AIR order.
  • MeteredCtx (ctx.rs:38-48): Correctly unzips constant heights. reset_trace_heights (segment_ctx.rs:415-430) properly skips subtraction for constant-height chips via is_trace_height_constant.
  • CUDA inventory.cu: Two-pass algorithm (build candidates → CUB exclusive sum → scatter compact) is correct. The same_output_block check correctly handles the case where two consecutive input records fall in the same chunk (lines 102-110). Montgomery encoding consistently applied for initial memory reads (line 63).
  • Keccakf two-phase optimization: Phase 1 (128 threads, compute-only) and Phase 2 (256 threads, coalesced trace writes) is a well-designed optimization. Scratch overhead is negligible (~1.9%).

Items to Address

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450) — wrong function name
  2. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro
  3. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines)
  4. Testing: Add assertions to ColsRef example test (example.rs:27)
  5. Minor: Fix "Cachce" typo at benchmarks/execute/benches/execute.rs:91 (low priority since benchmarks are disabled)

  • Gather context on latest changes
  • Review new/changed files since last review
  • Check outstanding issues from prior reviews
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Claude finished @shuklaayush's task in 1m 37s —— View job


Code Review for PR #2592

This is a review after the latest sync (49 commits, 100 files, +31740/-17012). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase optimization
Deferral cleanup ✅ Correct Volatile references and MEMORY_OP_SIZE removed
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

New Commit: Deferral Volatile and MEMORY_OP_SIZE Cleanup

Verdict: Correct and complete.

Commit 90a9f21 removes MEMORY_OP_SIZE usage and volatile test builder references from deferral circuit tests:

  • MEMORY_OP_SIZEDEFAULT_BLOCK_SIZE (both are 4, consistent with the memory refactoring)
  • VmChipTestBuilder::volatile()VmChipTestBuilder::from_config() (volatile builder removed)
  • GpuChipTestBuilder::volatile()GpuChipTestBuilder::new() (same)

Verified: MEMORY_OP_SIZE is fully removed from the Rust side of deferral (extensions/deferral/circuit/src/). The CUDA headers (def_types.h:16) still define MEMORY_OP_SIZE = 4 which is correct — CUDA code needs its own constant and the value matches DEFAULT_BLOCK_SIZE. No volatile references remain in the test builders.


Outstanding Issues (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees. This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. ColsRef test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27: Uses println! with no assertions — should at minimum verify expected values.

5. Minor typo

benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache" (low priority since benchmarks are disabled)


Summary

The new deferral cleanup commit is correct and consistent with the broader memory system refactoring. The overall PR remains well-executed with solid architectural changes across SHA2-air, memory system, VM config, and CUDA. The only actionable items are the minor issues in the ColsRef derive macro (copy-paste bug, safety, duplication, test coverage).

  • Gather context on latest changes
  • Review new commit (90a9f21 deferral cleanup)
  • Check outstanding issues from prior reviews
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Claude finished @shuklaayush's task in 2m 23s —— View job


Code Review for PR #2592

This is a review after the latest sync (50 commits, 100 files, +31762/-17012). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase optimization
Deferral test fixes ✅ Correct Block-size-4 reads and volatile removal
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

New Commits Since Last Review

Commit Summary Status
90a9f21 fix(deferral): remove volatile and MEMORY_OP_SIZE ✅ Correct
ce72dd3 fix(deferral): read deferral as in block size of 4 in tests ✅ Correct

Deferral Test Fixes (ce72dd3)

Verdict: Correct. The read methods in both MemoryTester and DeviceMemoryTester now special-case DEFERRAL_AS to read as F instead of u8, since the deferral address space stores field elements directly. The new read_deferral_digest helper properly reads in DEFAULT_BLOCK_SIZE chunks and joins them via join_memory_ops.

The pointer casts output_ptr.to_le_bytes()(output_ptr as u32).to_le_bytes() are correct — output_ptr is usize and these are pointers that fit in 32 bits.

Note on asymmetry: The write method in MemoryTester (crates/vm/src/arch/testing/memory/mod.rs:52-67) and DeviceMemoryTester (cuda.rs:90-99) still always writes as u8 without a DEFERRAL_AS special case. This is fine for now since deferral tests never call tester.write() with DEFERRAL_AS (writes happen through the executor), but if future tests need to write to DEFERRAL_AS via the tester, they'll silently truncate field elements to u8. Consider adding a matching special case to write for consistency, or adding a debug_assert!(addr_space as u32 != DEFERRAL_AS) in the write path to catch misuse.

Deferral Volatile Cleanup (90a9f21)

Verdict: Correct. MEMORY_OP_SIZEDEFAULT_BLOCK_SIZE (both are 4), VmChipTestBuilder::volatile()VmChipTestBuilder::from_config(), GpuChipTestBuilder::volatile()GpuChipTestBuilder::new(). All consistent with the memory system refactoring.


Outstanding Issues (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees that size_of::<T>() is compatible with size_of::<#elem_type>(). This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. ColsRef test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27: Uses println! with no assertions — should at minimum verify expected values.

5. Minor typo

benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache" (low priority since benchmarks are disabled)


Summary

The 2 new commits are correct and well-targeted fixes for the deferral test infrastructure. The overall PR remains well-executed with solid architectural changes across SHA2-air, memory system, VM config, and CUDA.

Items to address (priority order):

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450)
  2. Consistency: Consider adding DEFERRAL_AS special case to write in test memory testers (or a debug assert to prevent misuse)
  3. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro
  4. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines)
  5. Testing: Add assertions to ColsRef example test
  6. Minor: Fix "Cachce" typo in benchmarks (low priority)
  • Gather context on latest changes since last review
  • Review new commits
  • Check outstanding issues from prior reviews
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Claude finished @shuklaayush's task in 1m 19s —— View job


Code Review for PR #2592

This is a review after the latest sync (51 commits, 100 files, +31766/-17012). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase optimization
Deferral fixes ✅ Correct Block-size-4 reads, volatile removal, GPU constant_trace_height
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

New Commit: GPU constant_trace_height for DeferralCircuitCountChip

Verdict: Correct. Commit a4f3988 adds constant_trace_height to the GPU version of DeferralCircuitCountChipGpu at extensions/deferral/circuit/src/count/cuda.rs:19-21:

fn constant_trace_height(&self) -> Option<usize> {
    Some(next_power_of_two_or_zero(self.num_deferral_circuits))
}

This mirrors the CPU implementation at extensions/deferral/circuit/src/count/trace.rs:41-43 which uses self.count.len(). Both are consistent — num_deferral_circuits on the GPU side equals count.len() on the CPU side (set at construction time). The returned value matches the height computed in generate_proving_ctx (line 29), ensuring correctness with the MeteredCtx system.


Outstanding Issues (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees that size_of::<T>() is compatible with size_of::<#elem_type>(). This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. ColsRef test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27: Uses println! with no assertions — should at minimum verify expected values.

5. Minor typo

benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache" (low priority since benchmarks are disabled)


Summary

The new commit correctly adds constant_trace_height to the GPU deferral count chip, completing the CPU/GPU parity for this trait. The overall PR remains well-executed with solid architectural changes across SHA2-air, memory system, VM config, and CUDA.

Items to address (priority order):

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450)
  2. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro
  3. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines)
  4. Testing: Add assertions to ColsRef example test
  5. Minor: Fix "Cachce" typo in benchmarks (low priority)
  • Gather context on latest changes since last review
  • Review new commits
  • Check outstanding issues from prior reviews
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Claude finished @shuklaayush's task in 2m 35s —— View job


Code Review for PR #2592

This is a review after the latest sync (52 commits, 100 files, +31787/-17010). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase, deferral field-element fix
CI/benchmarks ℹ️ Disabled Benchmark CI disabled (intentional TODO)

New Commit: Load Deferral AS Elements as Felts in CUDA Memory Kernels (#2607)

Verdict: Correct and necessary fix.

Commit fac6e28 fixes a critical bug where DEFERRAL_AS memory cells (which store field elements at 4 bytes/cell) were being read as individual bytes (1 byte/cell) in both boundary.cu and inventory.cu. This would break LogUp bus interactions because the Montgomery-encoded values would be misinterpreted.

Changes verified:

  • boundary.cu:52-60: For DEFERRAL_AS, uses FpArray<8>::from_raw_array(reinterpret_cast<uint32_t const *>(initial_mem[addr_space_idx]) + record.ptr) — correctly reads 4-byte Montgomery-encoded cells. For other address spaces, continues using FpArray<8>::from_u8_array().
  • inventory.cu:62-76: Same pattern — DEFERRAL_AS reads via reinterpret_cast<uint32_t const *>(mem) + chunk_ptr, others read as u8 and convert to Montgomery form via Fp(mem[byte_offset + i]).asRaw().
  • Both are consistent with the Rust-side configuration where DEFERRAL_AS defaults to MemoryCellType::field32() (config.rs:191) while register/memory spaces use MemoryCellType::U8 (config.rs:193-197).

One concern: The hardcoded DEFERRAL_AS = 4 in both .cu files duplicates the Rust constant (openvm_instructions::DEFERRAL_AS). The TODO comments acknowledge this, but if the value ever changes on the Rust side, the CUDA code would silently break. Consider generating this constant from Rust or adding a static assert in the Rust ↔ CUDA interface layer.


Outstanding Issues (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees that size_of::<T>() is compatible with size_of::<#elem_type>(). This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. ColsRef test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27: Uses println! with no assertions — should at minimum verify expected values.

5. Minor typo

benchmarks/execute/benches/execute.rs:91: "Cachce" → should be "Cache" (low priority since benchmarks are disabled)


Areas Verified Sound

  • SHA2-air crate: All FIPS 180-4 constants correct. Constraint degree managed at ≤3 via slack values. SOUNDNESS.md is thorough.
  • Memory system refactoring: Direction constraint d = d³ correctly constrains d ∈ {-1, 0, 1}. Initial row selection when(d * (d+1)) correctly enables only for d = 1. Compression bus d * d correctly enables for d = ±1. Memory chunking math (CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2) verified correct.
  • VM config simplification: DEFAULT_BLOCK_SIZE = 4 consistently replaces dynamic block_size.
  • constant_trace_height trait: Clean default None. Collection in extensions.rs correctly reverses order. Both CPU and GPU deferral count chips implement it consistently.
  • MeteredCtx: Correctly separates constant vs variable trace heights. reset_trace_heights properly skips subtraction for constant-height chips.
  • CUDA inventory.cu: Two-pass algorithm (build candidates → CUB exclusive sum → scatter compact) is correct.
  • Keccakf two-phase optimization: Phase 1 (128 threads, compute-only) and Phase 2 (256 threads, coalesced writes) — correct with negligible scratch overhead.

Summary

This is a well-executed PR with solid architectural changes. The latest commit (fac6e28) correctly fixes the critical DEFERRAL_AS field-element handling in CUDA memory kernels. The SHA2-air generalization, memory system refactoring, and VM config simplification are all sound.

Items to address (priority order):

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450)
  2. Robustness: Address hardcoded DEFERRAL_AS = 4 in .cu files (add a cross-language assertion or generation)
  3. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro
  4. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines)
  5. Testing: Add assertions to ColsRef example test
  6. Minor: Fix "Cachce" typo in benchmarks (low priority)
  • Gather context on latest changes
  • Review new commit (fac6e28 deferral field-element fix)
  • Review key areas (SHA2-air, ColsRef, memory, VM arch, CUDA)
  • Post final review

@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch 4 times, most recently from d4719da to 05f4de6 Compare March 25, 2026 15:16
@shuklaayush shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from 05f4de6 to a0ee68f Compare March 25, 2026 20:28
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 25, 2026

Claude finished @shuklaayush's task in 2m 15s —— View job


Code Review for PR #2592

This is a review after the latest sync (53 commits, 324 files, +31788/-17009). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase, deferral field-element fix
Benchmarks/CI ✅ Cleaned up Rebase fixes properly remove stale deps

Latest Commit: a0ee68f chore: rebase fixes

Verdict: Correct. Removes stale dependencies (openvm-continuations, openvm-sdk, bitcode, serde) from benchmark crates and fixes unused import issues in execute.rs. The generate-fixtures feature and its deps are properly removed from benchmarks/utils/Cargo.toml. Clean rebase artifact cleanup.


Outstanding Issues (Carried Forward)

These were identified in prior reviews and remain unaddressed:

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees that size_of::<T>() is compatible with size_of::<#elem_type>(). This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. ColsRef test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27: Uses println! with no assertions — should at minimum verify expected values:

println!("{}, {}", test.arr, test.sum);  // no assertions

5. Hardcoded DEFERRAL_AS = 4 in CUDA files

boundary.cu:9 and inventory.cu:23 both hardcode DEFERRAL_AS = 4, duplicating the Rust constant at crates/toolchain/instructions/src/lib.rs:21. If the value changes on the Rust side, the CUDA code would silently break. Consider generating this constant from Rust or adding a cross-language assertion.


Areas Verified Sound

  • SHA2-air crate: All FIPS 180-4 constants correct. Constraint degree managed at ≤3. SOUNDNESS.md is thorough.
  • Memory system: Direction constraint d = d³ correctly constrains d ∈ {-1, 0, 1}. Initial row selection when(d * (d+1)) correctly enables only for d = 1. Compression bus d * d enables for d = ±1. Memory chunking (CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2) verified correct.
  • VM config: DEFAULT_BLOCK_SIZE = 4 consistently replaces dynamic block_size.
  • constant_trace_height trait: Clean default None. Both CPU and GPU deferral count chips implement it consistently.
  • MeteredCtx: Correctly separates constant vs variable trace heights. Segmentation logic properly skips padding checks for constant heights.
  • CUDA inventory.cu: Two-pass algorithm (build candidates → CUB exclusive sum → scatter compact) is correct.
  • CUDA boundary.cu: DEFERRAL_AS field-element handling correctly reads 4-byte Montgomery-encoded cells.
  • Keccakf two-phase optimization: Phase 1 (128 threads, compute) and Phase 2 (256 threads, coalesced writes) is correct.
  • Benchmarks rebase: "Cachce" typo fixed. Stale deps removed.

Summary

This is a well-executed PR. The latest rebase fixes commit properly cleans up stale benchmark dependencies. The SHA2-air generalization, memory system refactoring, VM config simplification, and CUDA changes are all sound.

Items to address (priority order):

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450)
  2. Robustness: Address hardcoded DEFERRAL_AS = 4 in .cu files
  3. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro
  4. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines)
  5. Testing: Add assertions to ColsRef example test
  • Gather context on latest changes since last review
  • Review new/changed files
  • Check outstanding issues from prior reviews
  • Post final review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 25, 2026

Claude finished @jonathanpwang's task in 1m 0s —— View job


Code Review for PR #2592

This is a review after the latest sync (54 commits, 343 files, +31829/-17012). All changes reviewed against develop-v2.0.0-beta.


Overview

Area Status Summary
SHA2-air (replaces SHA256-air) ✅ Excellent Generalized SHA-2 family support with clean trait abstraction
ColsRef derive macro ⚠️ Has issues New proc macro with a copy-paste bug, safety concerns, and duplication
Memory system refactoring ✅ Sound Volatile memory + access adapters removed, persistent memory unified
VM architecture changes ✅ Correct Config simplified, constant_trace_height added cleanly
CUDA/GPU changes ✅ Excellent inventory.cu, keccakf two-phase, deferral field-element fix
Benchmarks/CI ✅ Correct Guest programs updated, stale deps removed, CI properly configured

Latest Commits (since last review)

Commit Summary Status
2ea2f43 chore(v2-rc.1): update benchmark guest programs (#2623) ✅ Correct
a0ee68f chore: rebase fixes ✅ Clean dependency removal

Benchmark guest programs (2ea2f43): SHA2 interface migration is clean — sha256() function calls correctly updated to Sha256::new() / Sha256::digest() API. revm programs properly add openvm-algebra-guest and openvm-ecc-guest dependencies with correct openvm.toml configs for Secp256k1. All ELF binaries regenerated.

Rebase fixes (a0ee68f): Correctly removes stale openvm-continuations, openvm-sdk, bitcode, serde dependencies from benchmark crates. Import deduplication in execute.rs is clean.


SHA2-air Crate

Verdict: Excellent. No correctness or soundness issues found.

  • FIPS 180-4 constants (config.rs): All K, H, rotation/shift values verified correct for SHA-256, SHA-384, and SHA-512.
  • Constraint degree (air.rs): Properly managed at ≤3 for v2. The optimization from commit 755b407 correctly removed when_transition() wrappers and uses dummy values on non-applicable rows instead.
  • Block chaining soundness (air.rs:317-373): Uses private bus for interaction-based self-constraints. Last block wraps to first block correctly via select.
  • Carry propagation (trace.rs:147-160): Correct 16-bit limb carry extraction. Bounds documented: carry_a ≤ 6, carry_e ≤ 5.
  • Two-pass trace generation (trace.rs): Pass 1 generates message schedule and work variables; Pass 2 fills remaining intermediate values. Wrap-around edge case properly handled.
  • SOUNDNESS.md: Thorough documentation of the constraint scheme.

ColsRef Derive Macro

Verdict: Functional but has several issues worth addressing.

1. Bug: Copy-paste error in panic message

crates/circuits/primitives/derive/src/cols_ref/mod.rs:450: This panic is inside get_const_cols_ref_fields (lines 292-454) but the message says "get_mut_cols_ref_fields":

panic!("Unsupported field type (in get_mut_cols_ref_fields): {:?}", f.ty);
//                                  ^^^ Should be "get_const_cols_ref_fields"

Fix this →

2. Safety: Unsafe pointer cast without alignment verification

mod.rs:341 (const) and mod.rs:506 (mut): The generated code casts &[T] to &[#elem_type] via raw pointer without compile-time guarantees that size_of::<T>() is compatible with size_of::<#elem_type>(). This relies on runtime debug_asserts in AlignedBorrow::borrow() which are stripped in release builds. A compile-time const_assert! on size compatibility would prevent silent UB.

3. High code duplication

get_const_cols_ref_fields (lines 292-454) and get_mut_cols_ref_fields (lines 457-619) are ~160 lines each and ~90% identical. Only differences are split_at vs split_at_mut, &'a vs &'a mut, and ArrayView vs ArrayViewMut. A parameterized helper or internal macro would cut ~150 lines of duplication.

4. Test has no assertions

crates/circuits/primitives/derive/tests/example.rs:27: Uses println! with no assertions — should at minimum verify expected values.


Memory System Refactoring

Verdict: Clean and complete. No soundness issues.

  • Direction constraint (persistent.rs:85-89): assert_eq(d, d³) correctly constrains d ∈ {-1, 0, 1}.
  • Initial row selection (persistent.rs:91-98): when(d * (d + 1)) correctly enables only for d = 1 (evaluates to 2, truthy). For d = -1: (-1) × 0 = 0 (disabled). For d = 0: 0 × 1 = 0 (disabled).
  • Compression bus (persistent.rs): d * d correctly enables for both d = ±1 and disables for d = 0.
  • Memory chunking (CHUNK=8, DEFAULT_BLOCK_SIZE=4, BLOCKS_PER_CHUNK=2): All index calculations verified correct.

VM Architecture & CUDA Changes

Verdict: All correct and consistent.

  • Config simplification (config.rs): DEFAULT_BLOCK_SIZE = 4 correctly replaces dynamic block_size. No lingering dynamic references.
  • constant_trace_height trait (chip.rs:14-20): Clean default None. Collection in extensions.rs:586-596 correctly uses .rev() to match verifying key AIR order.
  • MeteredCtx (ctx.rs): Correctly separates constant vs variable trace heights. AIR ID constants (BOUNDARY_AIR_ID, MERKLE_AIR_ID) verified correct with proper debug_assert!s.
  • CUDA DEFERRAL_AS fix (boundary.cu, inventory.cu): Correctly reads field elements as 4-byte Montgomery-encoded cells for DEFERRAL_AS (commit 90c7686).
  • CUDA inventory.cu: Two-pass algorithm (build candidates → CUB exclusive sum → scatter compact) is correct.
  • Keccakf two-phase optimization: Phase 1 (128 threads, compute-only) and Phase 2 (256 threads, coalesced writes) — correct with negligible scratch overhead.

Note: Hardcoded DEFERRAL_AS = 4 in CUDA files

boundary.cu:9 and inventory.cu:23 both hardcode DEFERRAL_AS = 4, duplicating the Rust constant at crates/toolchain/instructions/src/lib.rs:21. Both files have TODO comments acknowledging this. If the value ever changes on the Rust side, the CUDA code would silently break. Consider generating this constant from Rust or adding a cross-language assertion.


Items to Address (Priority Order)

  1. Bug: Fix copy-paste panic message in ColsRef macro (mod.rs:450) — wrong function name in error string
  2. Robustness: Address hardcoded DEFERRAL_AS = 4 in .cu files (add a cross-language assertion or generation)
  3. Safety: Consider compile-time alignment assertions for unsafe casts in ColsRef macro (mod.rs:341, mod.rs:506)
  4. Maintenance: Refactor duplicated const/mut paths in ColsRef macro (~300 lines of near-identical code)
  5. Testing: Add assertions to ColsRef example test (example.rs:27)

  • Gather context on latest changes since last review
  • Review new/changed files
  • Review SHA2-air crate
  • Review ColsRef derive macro
  • Review memory system refactoring
  • Review VM architecture & CUDA changes
  • Review CI/benchmark changes
  • Post final review

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from 2ea2f43 to f2c28ec Compare March 26, 2026 19:42
shuklaayush and others added 28 commits April 24, 2026 19:54
deletes the custom sccache action and uses runson
depends on #2685

resolves int-6953
Co-authored-by: Ayush Shukla <ayush@axiom.xyz>
Closes INT-6464

## CUDA Explicit Stream Migration (OpenVM side)

Companion to
[stark-backend#317](openvm-org/stark-backend#317).
Adds `cudaStream_t stream` to all OpenVM CUDA launchers and FFI
wrappers, injects `DeviceContext` through the VM builder and chip
construction APIs, removes all PTDS references, and lifts `DeviceCtx` to
a first-class associated type on `ProverDevice` — eliminating
`MaybeDeviceContext`, `#[cfg]` duplication, and conditional `device_ctx:
Option<&DeviceContext>` parameters.

**224 files changed, +2902 / -1353**

See
[cuda-stream-migration-design.md](https://github.com/openvm-org/v2-proof-system/blob/test/stream-migration/cuda-stream-migration-design.md)
for full design rationale.

---

## CUDA launchers (~67 `.cu` files + 2 `.cuh` headers)

Every `extern "C"` launcher gains `cudaStream_t stream` as final
parameter. All kernel launches use `<<<grid, block, 0, stream>>>`. All
CUB calls (`DeviceScan`, `DeviceReduce`, `DeviceMergeSort`) pass
explicit `stream`. `scan.cuh` and `affine_scan.cuh` had `=
cudaStreamPerThread` default parameters removed — `stream` is now
required.

---

## FFI wrappers (15 `cuda_abi.rs` / `abi.rs` files)

Every `extern "C"` declaration and safe Rust wrapper gains `stream:
cudaStream_t`. Covers `crates/vm`, `crates/circuits/primitives`,
`crates/circuits/poseidon2-air`, `crates/recursion` (6 files), and all
extensions (`rv32im`, `keccak256`, `bigint`, `sha2`, `deferral`).

---

## VM builder — `DeviceContext` injection

`VmBuilder::create_chip_complex` gains `device: &E::PD` parameter.
`VirtualMachine::new` passes `engine.device()` through. 25
implementations updated (19 CPU impls ignore it, 6 GPU impls extract
`DeviceContext`).

---

## Stateful GPU chip constructors

GPU chips that allocate persistent `DeviceBuffer`s now store
`device_ctx: DeviceContext` and use `_on` allocation variants:

- `BitwiseOperationLookupChipGPU` — histogram accumulator
- `VariableRangeCheckerChipGPU` — range check accumulator
- `RangeTupleCheckerChipGPU` — range tuple accumulator
- `Poseidon2ChipGPU` — shared hash records buffer + index counter
- `MemoryMerkleTree` — merkle tree device state

The `Chip::generate_proving_ctx(&self, records)` trait is **unchanged**
— chips access `self.device_ctx` internally.

---

## VM system CUDA modules (`crates/vm/src/system/cuda/`)

All GPU system modules use `self.device_ctx.stream.as_raw()`:
`poseidon2.rs`, `memory.rs`, `boundary.rs`, `phantom.rs`, `program.rs`,
`merkle_tree/mod.rs`.

---

## `DeviceCtx` as associated type on `ProverDevice` (87 files, -630 net
lines)

The final refactor lifts `DeviceContext` from a `#[cfg(feature =
"cuda")]`-gated parameter to a first-class associated type on
`ProverDevice` (defined in stark-backend#317):

```rust
// In ProverDevice (stark-backend):
type DeviceCtx: Clone + Send + Sync;
fn device_ctx(&self) -> &Self::DeviceCtx;
// GpuDevice: DeviceCtx = DeviceContext
// CpuDevice: DeviceCtx = ()
```

This eliminates:
- `MaybeDeviceContext` trait + all 3 impls + `device_ctx_for_engine()`
helper — **deleted entirely**
- ~6 pairs of `#[cfg(feature = "cuda")]` / `#[cfg(not)]` duplicated
methods in continuations provers (`inner/mod.rs`,
`deferral/hook/mod.rs`, `deferral/inner/mod.rs`) — **unified into single
versions**
- ~30 `#[cfg(feature = "cuda")] device_ctx: Option<&DeviceContext>`
conditional parameters — **replaced with unconditional `&DC` generic**
- cfg-duplicated `VerifierTraceGen` trait methods in
`recursion/src/system/mod.rs` — **unified**

Callers now use `engine.device().device_ctx()` directly. The
`EngineDeviceCtx<E>` type alias avoids spelling out the full associated
type path.

---

## Recursion — `VerifierTraceGen` and module tracegen

`VerifierTraceGen` and `InnerTraceGen` traits gain a `DC` generic
parameter for the device context type. `generate_proving_ctxs` accepts
`&DC` unconditionally. Stream synchronization at proof boundaries via
`device_ctx.stream.synchronize()`.

---

## Continuations + guest verifier

`InnerTraceGen`, `DeferralHookTraceGen`, `DeferralInnerTraceGen`,
`RootTraceGen` — all accept `&DC` and use the engine-owned context.
`agg_prove`, `from_pk`, `new` — single versions, no `#[cfg]`
duplication. Guest verifier circuit trace generation passes device
context through.

---

## Extensions

All extension GPU modules pass `stream` to FFI calls: `rv32im` (all
ALU/branch/load/store cuda modules), `sha2`, `keccak256`, `bigint`,
`deferral`.

---

## Removed

- `--default-stream=per-thread` from `scripts/bin/ptx_details.rs`
- All `cudaStreamPerThread` imports and usages
- All `current_stream_sync()` / `current_stream_id()` usages
- `MaybeDeviceContext` trait + `device_ctx_for_engine()` helper
- Temporary `DeviceContext` escape hatches (fresh streams created
outside engine)
- ~400 lines of `#[cfg]` duplication in continuations provers

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

Refactors the SHA-2 block hasher CUDA tracegen kernel from a monolithic
1-thread-per-block design into a three-phase pipeline for better memory
coalescing on column-major traces.

| Phase | Kernel | Parallelism | Purpose |

|-------|--------------------------|-----------------|----------------------------------------------------------------|
| 1 | `sha2_first_pass_phase1` | 1 thread/block | Compute SHA-2 rounds,
snapshot `{a..h, w_buf}` to scratch |
| 2 | `sha2_first_pass_phase2` | 1 thread/row | Read snapshot, write
trace columns (coalesced) |
| 3 | `sha2_first_pass_phase3` | 1 thread/block | Cross-row dependencies
(`intermed_4/8/12`, carry) |

Also drops unused parameters (`ptr_max_bits`,
`range_checker_ptr`/`num_bins`, `timestamp_max_bits`) from the block
hasher kernel and its Rust-side wrappers, simplifying
`Sha2BlockHasherChipGpu`.

## Changes

- `extensions/sha2/circuit/cuda/src/sha2_hasher.cu`: split first-pass
tracegen into three phases; tightened `CHECK_KERNEL` error propagation
in `launch_sha2_second_pass_dependencies`.
- `extensions/sha2/circuit/src/cuda/{mod.rs,cuda_abi.rs}`: wire the new
scratch buffer and drop unused wrapper arguments.
- `extensions/sha2/circuit/src/extension/cuda.rs`: update
`Sha2BlockHasherChipGpu::new` call sites.
- `extensions/sha2/circuit/src/sha2_chips/tests.rs`: update tests to
match the new constructor signature.

## Correctness notes

- Phase 1 snapshots state before each row's rounds; Phase 2 restores
that snapshot and replays the rounds to compute intermediate trace
values. The round computations are identical across phases.
- Scratch is sized as `num_blocks * ROWS_PER_BLOCK * (8 + BLOCK_WORDS)`
words, exactly matching what's written/read (no oversize).
- Phase 2 launches `num_blocks * ROWS_PER_BLOCK` threads, so every
thread maps to a valid block.
- Padding rows are still handled by the existing
`sha2_fill_first_dummy_row` + `sha2_fill_invalid_rows` kernels.
- Digest row: Phase 1 snapshots the final state (after all rounds) for
`row_in_block == ROUND_ROWS`; Phase 2 reads it to compute `final_hash =
prev_hash + work_vars`.

## Test plan

- [ ] `cargo nextest run --cargo-profile=fast -p openvm-sha2-circuit`
- [ ] CUDA tracegen tests for both SHA-256 and SHA-512 variants (feature
`cuda`)
- [ ] Verify trace matrices match CPU reference for representative
inputs

resolves int-6985
Towards INT-7429.

 ## Summary

- Mark `finalize_ptr`, `finalize`, and `native_keccak256` as `unsafe fn`
in both `zkvm_impl.rs` and `host_impl.rs`. These functions write 32
bytes to an output pointer/slice via `copy_nonoverlapping` without
runtime size checks (only `debug_assert!`), so callers must guarantee
the buffer is at least 32 bytes.
- Add `assert!` in the `tiny_keccak::Hasher` trait impl in `lib.rs`
since the trait method must be safe and cannot propagate `unsafe` to
callers.
- Safe public APIs (`keccak256()`, `set_keccak256()`) are unchanged —
they pass correctly-sized buffers internally.

**TODO:** The `openvm-org/tiny-keccak` fork and
`openvm-org/openvm-primitives` also call `Keccak256::finalize` /
`native_keccak256` and will need matching updates (add `unsafe {}`
blocks and/or `assert!` at their trait boundaries).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
…ebra+ecc + update sha2 docs (#2710)

Resolves INT-7439.
Towards INT-7429.

## Summary

- Fix OOB read in `native_xorin` when `len` is not 4-byte aligned:
`AlignedBuf::new` copied `adjusted_len` bytes from source buffers,
reading 1-3 bytes past the valid region. Replace with
`AlignedBuf::uninit` + `copy_nonoverlapping` of only `len` bytes.
- Mark `native_xorin` and `native_keccakf` as `unsafe extern "C" fn`
since they take raw pointers and require callers to guarantee pointer
validity and buffer sizes. Add `unsafe` blocks in callers `xorin` and
`keccakf` with safety comments.
- Update keccak extension docs code example to reflect the new unsafe
signatures.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Towards INT-7478.

- Replace `assert_eq!` with
`Err(StaticProgramError::InvalidInstruction)` in multiplication executor
`pre_compute_impl` for both bigint (mult.rs) and rv32im
(mul/execution.rs). Functions returning `Result` should use error
returns instead of panics, consistent with other executors in the
codebase.
- Add `/// # Safety` doc comments to all 12 `unsafe extern "C" fn` in
bigint guest externs (externs.rs), documenting pointer validity, buffer
size, and alignment preconditions. Follows precedent set by PRs #2706
(keccak) and #2708 (sha2).
- Document unused branch opcode variants.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves INT-7541. Issue found and initially corrected by [Daniel
Heim](https://github.com/dghelm).
@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch from dada909 to e4aab20 Compare April 24, 2026 19:54
@github-actions
Copy link
Copy Markdown

group app.proof_time_ms app.cycles leaf.proof_time_ms
fibonacci 3,880 12,000,265 966
keccak 18,541 18,655,329 3,305
sha2_bench 9,094 14,793,960 1,414
regex 1,411 4,137,067 374
ecrecover 646 123,583 272
pairing 905 1,745,757 283
kitchen_sink 2,092 2,579,903 436
fibonacci_e2e 1,864 12,000,265 452
regex_e2e 864 4,137,067 192
ecrecover_e2e 554 123,583 151
pairing_e2e 658 1,745,757 153
kitchen_sink_e2e 2,218 2,579,903 427

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: e4aab20

Benchmark Workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-benchmark triggers benchmark workflows on the pr run-benchmark-e2e

Projects

None yet

Development

Successfully merging this pull request may close these issues.