Skip to content

k256: endomorphism-aware wNAF for vartime scalar multiplication#1745

Open
42Pupusas wants to merge 4 commits intoRustCrypto:masterfrom
42Pupusas:k256/schnorr-verify-perf
Open

k256: endomorphism-aware wNAF for vartime scalar multiplication#1745
42Pupusas wants to merge 4 commits intoRustCrypto:masterfrom
42Pupusas:k256/schnorr-verify-perf

Conversation

@42Pupusas
Copy link
Copy Markdown
Contributor

Summary

Replaces the placeholder MulVartime / MulByGeneratorVartime impls
for k256::ProjectivePoint (which fell back to the constant-time path
and had TODOs to match) with a real endomorphism-aware width-5 wNAF,
then folds the combined mul_by_generator_and_mul_add_vartime into a
single shared-doublings ladder over all 4 GLV sub-scalars.

Closes #1725.

What changes

  • Commit 1 — wNAF core. GLV-decompose the scalar into two ~128-bit
    halves, compute a width-5 signed-digit NAF of each magnitude, and run
    a standard left-to-right double-and-add with precomputed odd
    multiples [P, 3P, ..., 15P]. Sign is folded into the precomputed
    points at setup.
  • Commit 2 — share doublings. Extract a small WnafSlot
    (odd-multiples table + digits) and a wnaf_ladder helper. The
    combined a*G + b*P variant runs one ladder over all 4 GLV slots
    (G + βG + P + βP), doing one double() per step instead of two
    independent ladders.
  • Commit 3 — debug_assert. Guard the fixed 130-entry wNAF digit
    buffer; the bound is currently implicit in WNAF_WIDTH = 5, this
    makes it explicit at test time.

Perf (Schnorr verify, default features, x86_64)

Stage time/verify
Master (constant-time fallback) ~62 µs
After commit 1 (wNAF) ~53 µs
After commit 2 (shared ladder) ~50 µs

~19% faster end-to-end. Also speeds up any other user of
MulVartime / MulByGeneratorVartime on the k256 curve.

Test plan

  • cargo test -p k256 --lib --features getrandom — 89 passed
  • New randomized tests for mul_vartime and
    mul_and_mul_add_vartime vs. the constant-time reference
    (32 iterations each with ProjectivePoint::generate() and
    Scalar::generate()).
  • Edge-case tests: scalar = 0, 1, −1, point = identity.
  • cargo bench -p k256 --bench schnorr -- verify on an idle host
    confirms the numbers above (criterion-reported change is stable
    across runs).

Notes

  • Not constant time; SECURITY: comments are on the two vartime impls.
    Only reachable via the MulVartime / MulByGeneratorVartime traits
    that callers opt into for non-secret scalars.
  • Briefly explored several further optimizations (batched-affine
    odd-multiples via Montgomery's trick; static precomputed G tables
    with mixed-add; wider window for the G side). The first two
    regressed perf at this width/scalar size; the last gave ~4% more but
    added a ~6 KB static, const-generics, and a new LazyLock path —
    not worth the complexity for a single-curve specialization. This PR
    sticks to the change that's pure upside.

42Pupusas and others added 4 commits April 23, 2026 18:56
Replaces the placeholder MulVartime / MulByGeneratorVartime impls (which
just called the constant-time path and had TODOs to match) with a width-5
wNAF that uses the GLV endomorphism to split each scalar into two
~128-bit halves.

Schnorr verify: ~62 µs -> ~53 µs (14% faster, no precomputed-tables;
~55 µs with tables). Addresses RustCrypto#1725.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the combined `mul_by_generator_and_mul_add_vartime` into a single
wNAF ladder over all 4 GLV sub-scalars (s1, s2 for G and the
endomorphism; e1, e2 for P and the endomorphism). One `double()` per
step instead of two independent ladders.

Factors out a small `WnafSlot` (odd-multiples table + digits) and a
`wnaf_ladder` helper so the single-point `mul_vartime` and the combined
op share the same loop body.

Schnorr verify: ~53 µs -> ~50 µs (no precomputed-tables; ~51 µs with
tables). Total vs. pre-wNAF baseline: ~62 µs -> ~50 µs (~19% faster).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` writes into a fixed 130-entry buffer; the bound holds for the
current `WNAF_WIDTH = 5` and the ≤128-bit GLV sub-scalars, but it's
implicit. Add a `debug_assert!` in the loop so that any future change to
`WNAF_WIDTH` that invalidates the bound is caught at test time rather
than silently writing out of bounds in worst-case inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`wnaf_128` tracked the residual scalar in two u64 limbs, but a negative
recentered digit adds up to 2^(W-1) − 1 to the value, which can legitimately
overflow past bit 127 when the input is close to 2^128 − 1. The old code
let `hi.wrapping_add(1)` silently wrap, losing the carried bit and
producing a NAF that reconstructs to the wrong value.

The GLV decomposition's `(r1, r2)` each have magnitude strictly less than
2^128, so values in the carry-out window are possible (though
vanishingly rare in random scalars — which is why the existing
randomized tests never caught it).

Fix by carrying the overflow bit into a third limb `top` that is absorbed
back on the next right-shift. Perf impact is in the noise: the `top`
branch is almost never taken and the predictor handles it cleanly.

Add two regression tests:

- `test_wnaf_128_reconstruction_adversarial` — reconstructs the NAF of a
  scalar with low 128 bits = 0xFF..FF and asserts it equals 2^128 − 1.
- `test_mul_vartime_adversarial_scalars` — end-to-end check that
  `mul_vartime(P, k)` matches the constant-time reference when `k`'s low
  128 bits trigger the carry window.

Also add a `debug_assert!` on `idx` in `WnafSlot::apply` to guard the
parallel invariant (`idx < WNAF_TABLE_SIZE`) if `WNAF_WIDTH` is ever
widened without growing the table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

k256: endomorphism-aware wNAF implementation

1 participant