Consolidate vmp/svp traits, improve CRT^-1 by ~x2 by Pro7ech · Pull Request #135 · phantomzone-org/poulpy

Pro7ech · 2026-03-03T09:00:31Z

Remove VmpApplyDftToDftAdd and SvpApplyDftToDftAdd traits; merge additive variant into VmpApplyDftToDft / SvpApplyDftToDft via a new limb_offset parameter.
These traits accumulated VMP results directly into a scattered output buffer, causing severe cache misses. Writing into a contiguous temporary buffer and folding with VecZnxDftAddInplace is ~2× faster.
Remove all associated OEP (VmpApplyDftToDftAddImpl, VmpApplyDftToDftAddTmpBytesImpl, SvpApplyDftToDftAddImpl), delegate, and bench-suite plumbing.

Update FFT64 and NTT120 vmp_apply_dft_to_dft implementations to accept limb_offset directly, replacing the separate _add codepath.
NTT120 AVX2 (arithmetic_avx.rs): add reduce_b_and_apply_crt that fuses the CRT multiply into the Barrett reduction pass, using new compile-time constants POW32_CRT and POW16_CRT; apply to compact_all_blocks to reduce instruction count by a factor of ~2x.

Rewrite external product (glwe_external_product_internal) and GLWE keyswitching inner loops to write intermediate per-digit VMP results into a dedicated temporary buffer before accumulating with VecZnxDftAddInplace, avoiding scattered-write cache thrashing. where bounds updated accordingly.
Add bench_suite::keyswitch::gglwe module and keyswitch_glwe criterion benchmark targeting the NTT120 backend; remove the old FFT64-specific keyswitch_glwe_fft64 benchmark.

Pro7ech added the enhancement New feature or request label Mar 3, 2026

Pro7ech self-assigned this Mar 3, 2026

Consolidate vmp/svp traits, improve CRT^-1 by ~x2

7551837

Pro7ech force-pushed the dev_impl_opt branch from d28ba1e to 7551837 Compare March 3, 2026 09:04

Pro7ech merged commit 5a99010 into main Mar 12, 2026
1 check passed

Pro7ech deleted the dev_impl_opt branch March 12, 2026 06:51

Provide feedback