Skip to content

Conversation

@Bisht13
Copy link
Collaborator

@Bisht13 Bisht13 commented Feb 7, 2026

Reduce Peak Memory During Prove Step

Problem

After adding public inputs, the prove step for complete_age_check regressed from 1.84 GB to 2.24 GB peak memory.


Changes

Memory optimization (2.24 GB → 1.92 GB, −320 MB)

  • Destructured WhirR1CSCommitment in both single and dual witness paths to take ownership of masked/random polynomials, enabling explicit drop() before entering WHIR's prove_batch / prove
  • Split public input transcript interaction from weight vector allocation (add_public_inputs_to_transcript + build_public_weights) to defer the 64 MB allocation until after alphas are consumed
  • Dropped program and witness_generator before the prove call since they are only needed during witness generation

Switch default allocator to jemalloc (RSS: 2.39 GB → 1.90 GB, −490 MB)

  • Added feature-gated jemalloc support to ProfilingAllocator, enabled by default
  • System allocator remains available via:
–no-default-features –features profiling-allocator

Add release-fast build profile

  • cargo build --profile release-fast
  • 30s build time vs 2.5min for full release
  • Uses codegen-units = 16
  • Uses lto = "thin"

Benchmark (complete_age_check, 1.1M constraints)

Metric Before After
Profiling peak 2.24 GB 1.92 GB
RSS (system alloc) 2.39 GB
RSS (jemalloc, default) 1.90 GB

Allocator Comparison

Allocator RSS Duration
System 2.39 GB 3.30s
jemalloc ✅ 1.90 GB 3.81s
mimalloc 3.12 GB 3.37s

jemalloc was chosen as default for best RSS.
mimalloc was evaluated and rejected (worst RSS despite best wall-clock time).


Root Cause Analysis

The remaining ~80 MB gap from the original 1.84 GB is fully accounted for by the public inputs weight vector:

  • 64 MB in the statement
  • 64 MB cloned inside WHIR's prove_batch (line 279, read-only external crate)

Before public inputs, there were 6 weights; after, 7.

This overhead is inherent to the protocol and cannot be reduced without modifying the WHIR crate or changing the proof transcript structure.

@Bisht13 Bisht13 requested a review from ashpect February 7, 2026 10:48
@ashpect
Copy link
Collaborator

ashpect commented Feb 9, 2026

@Paradox Can you please rebase it with main, pr makes quite a few changes which breaks compatibility, such as -

  1. Change in proof format : transcript to narg_string which breaks existing proof
  2. In lazy r1cs, using old zstd method.
  3. DomainSeparator.instance(&empty) causing panic in debug mode (i've pushed commit which fixes this)
  4. Gnark verifier : this still expects the old proof format 'transcript' etc. (from 1)

- Destructure WhirR1CSCommitment to drop masked/random polynomials before
  WHIR prove_batch/prove, saving ~256 MB in dual-witness path
- Defer public input weight vector allocation until after alphas are consumed
- Drop program and witness_generator before prove call (~60 MB)
- Add feature-gated jemalloc as default allocator (RSS: 2.39 GB -> 1.90 GB)
- Add release-fast build profile (30s vs 2.5min)

Profiling peak: 2.24 GB -> 1.92 GB
RSS with jemalloc: 1.90 GB (complete_age_check, 1.1M constraints)
…commits

Move drop(self.program) and drop(self.witness_generator) immediately
after extracting public input indices, before the NTT-heavy commit
phase. Also drop acir_witness_idx_to_value_map right after its last
use in each branch rather than after both branches.
@Bisht13 Bisht13 force-pushed the px/reduce-prove-memory-jemalloc branch from b914aad to 65b12b0 Compare February 10, 2026 17:03
@Bisht13 Bisht13 force-pushed the px/reduce-prove-memory-jemalloc branch from 65b12b0 to 099ac78 Compare February 10, 2026 17:23
}
}

impl R1CSSolver for LazyR1CS {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The R1CSSolver for LazyR1CS is same as that of the R1CS implementation. Instead of the common code, consider extracting into common func, using macros etc.

}
}

fn ensure_decompressed(&self) -> &(Interner, SparseMatrix, SparseMatrix, SparseMatrix) {
Copy link
Collaborator

@ashpect ashpect Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using Result<&(...)> for better logging

postcard::to_allocvec(&matrices).expect("Failed to serialize R1CS matrices");
let mut compressed = Vec::new();
{
let mut encoder = XzEncoder::new(&mut compressed, 6);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file/bin.rs, the encoding used was xz level 9. it's better to have a global const which is 9 and used here as well

zeroize = "1.8.1"
xz2 = "0.1.7"


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

/// After the first access the decompressed matrices live in `cached`,
/// so the compressed blob is dead weight. Call this after the first
/// access to reclaim ~10 MB for a typical circuit.
pub fn free_compressed(&mut self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an assertion in free_compressed() to verify cache is populated: assert!(self.cached.get().is_some(), "Must access matrices before freeing");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants