You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comprehensive breakdown of time and space complexity for every function in the
library. N, M, P, K denote compile-time template parameters (polynomial degree,
matrix dimensions). da, db denote actual degrees at runtime. B = 8 × sizeof(W)
(bit width of the word type). All operations are constexpr.
1. core.h — Core Types & Primitives
2-Adic Primitives
Function
Time
Space
Notes
v2(W x)
O(1)
O(1)
std::countr_zero → single CPU instruction (tzcnt/bsf)
modinv_odd(W a)
O(log B) ≈ O(1)
O(1)
Newton/Hensel iteration: ⌈log₂(B)⌉ iterations, 2 multiplies per iteration. For uint64: 6 iterations.
exact_divide(W x, W d)
O(log B) ≈ O(1)
O(1)
1× modinv_odd + 1× multiply
div_2k_adic(U val, int k)
O(1)
O(1)
3 branches + shift + optional sign-extension mask
artin_schreier(W x)
O(1)
O(1)
x*x − x
detail::uint128_t (Software 128-bit)
Function
Time
Space
Notes
+, -, *, <<, >>, &, |, ^, ~, comparisons
O(1)
O(1)
1–2 word ops
* (multiply)
O(1)
O(1)
12 half-width multiplies via decomposed 64×64→128 schoolbook
/, % (division)
O(B) = O(128)
O(1)
Schoolbook shift-and-subtract; 128 iterations
Compile-Time Caches
Type
Constructor
Time
Space
Notes
StirlingCache<N,W>
ctor
O(N²) compile-time
O(N²)
Two N×N tables (S₂, s₁); DP recurrence; computed once per (N,W)
PascalCache<N,W>
ctor
O(N²) compile-time
O(N²)
One N×N table (binomial); computed once per (N,W)
Both are inline constexpr globals — zero runtime cost after instantiation.
Carry-chain multiply → poly_mul; NOT the coefficient-wise ring
Carry Chain
Function
Time
Space
Notes
synthetic_divide (Ring)
O(1)
O(1)
Tag dispatch, no computation
synthetic_divide (Polynomial)
O(N)
O(N) in-place
Half-word carry propagation: (I−N)⁻¹; one pass over N coefficients
carry_normalize (all)
O(1)
O(1)
Tag conversion
carry_chain(W* r, Accum* v, int N)
O(N)
O(N)
Single pass: r[i] = low(v[i]+carry), carry = high(v[i]+carry)
carry_chain_word(W hi, W lo)
O(1)
O(1)
Single-word carry
adc / add_overflow / mul_overflow
O(1)
O(1)
Compiler builtins or manual compare
Axiom Operations
Function
Time
Space
Notes
formal_derivative (Monomial)
O(N)
O(N−1)
D(xⁿ) = n·xⁿ⁻¹; single loop, coefficient scaling
forward_difference (Monomial)
O(N²)
O(N−1)
Double nested loop: Σ_{i,j} p[j]·C(j,i) using Pascal cache
exact_formal_derivative
O(N)
O(N−1)
≡ formal_derivative in monomial basis
exact_forward_difference
O(N²)
O(N−1)
≡ forward_difference in monomial basis
Polynomial Multiplication
Function
Time
Space
Notes
poly_mul_unsaturated (detail)
O(na·nb)
O(na+nb)
quad_width accumulators (hw unsigned __int128 for uint64_t when available), #pragma GCC ivdep for auto-vectorization
poly_mul(r, a, na, b, nb)
O(na·nb)
O(1) chunked
CHUNK_COUNT = 256/sizeof(accum_t); for uint64_t with __SIZEOF_INT128__ uses hw unsigned __int128 (~2.7× speedup deg 63 vs software); single-pass + one carry_chain per chunk
poly_mul_cw
O(N·M)
O(N+M)
Coefficient-wise convolution (standard ring); zero-skip for sparsity
verify_divmod_cw
O(N·M)
O(N+M)
One poly_mul_cw + verification loop
2. basis.h — Basis Conversion
Low-Level Conversions
All conversions are triangular matrix-vector products using Stirling number caches.
Function
Time
Space
Pattern
monomial_to_falling
O(N²)
O(N)
Upper-triangular mat-vec multiply by S₂(n,k); dword_t accum
falling_to_monomial
O(N²)
O(N)
Upper-triangular mat-vec multiply by s₁(n,k) (signed, stored unsigned)
monomial_to_taylor
O(N²)
O(N)
S₂(n,k) × k!; factorial accumulated incrementally
taylor_to_monomial
O(N²)
O(N)
s₁(j,k) × p[j]/j!; parity branch: modinv_odd vs div_2k_adic+modinv_odd; quad_width accum
High-Level Dispatch
Function
Time
Space
Notes
change_basis<ToBasis>
O(N²)
O(N)
Template dispatch; FF↔Taylor routes through Monomial (2 conversions)
Basis-Specific Operations
Function
Time
Space
Notes
eval (FallingFactorial)
O(N)
O(1)
Incremental falling product: Π(x−j)
eval (Taylor)
O(N)
O(1)
Incremental binomial coefficient: (x choose i)
formal_derivative (FF)
O(N²)
O(N)
Convert → Monomial D → convert back (no direct FF derivative formula)
formal_derivative (Taylor)
O(N²)
O(N)
Convert → Monomial D → convert back
forward_difference (FF)
O(N)
O(N)
Direct: Δ(FFₙ) = n·FFₙ₋₁ — FF diagonalises Δ
forward_difference (Taylor)
O(N)
O(N)
Direct: Δ(Tₙ) = Tₙ₋₁ — Taylor nearly diagonalises Δ
3. calculus.h — Taylor Shift, Indefinite Sum, Exp/Log
Taylor Shift
Overload
Time
Space
Notes
taylor_shift (Monomial)
O(N²)
O(N)
Double nested loop using Pascal cache and δⁿ⁻ᵏ powers
taylor_shift (FallingFactorial)
O(N²)
O(N)
Uses falling factorial of delta: (δ)₍ₖ₋ᵢ₎
taylor_shift (Taylor)
O(N²)
O(N)
Converts to monomial, shifts, converts back (3× constant factor)
Indefinite Sum (Σ = Δ⁻¹)
Overload
Time
Space
Notes
indefinite_sum (FallingFactorial)
O(N)
O(N)
Σ(FFₙ₋₁) = FFₙ/n; single loop with 2-adic division per term
indefinite_sum (Monomial)
O(N²)
O(N)
Convert → FF sum → convert back
indefinite_sum (Taylor)
O(N²)
O(N)
Convert → Monomial → FF sum → Monomial → Taylor (4× constant)
Power Series Exp/Log
Function
Time
Space
Notes
poly_exp
O(N²)
O(N)
Recurrence: Eₙ = (1/n)·Σ k·Pₖ·Eₙ₋ₖ; dword_t intermediates; precondition P[1] even
Two GhostPowersFull::init + N ghost comps + ghost_recover
witt_add
O(N²)
O(N²)
ghost_op with std::plus
witt_mul
O(N²)
O(N²)
ghost_op with std::multiplies
adams_operation
O(N² + N·n)
O(N²)
Ghost component ^ n powering; early return for n ≤ 1
teichmueller_lift
O(N)
O(N)
Zero-fills after index 0
Witt Log/Exp/Inverse
Function
Time
Space
Notes
p_adic_log_1plus_impl
O(B)
O(1)
T = 2·B ≈ 128–256 terms; Σ (−1)ⁿ⁺¹yⁿ/n; early exit on vanishing. For W=uint64_t with __SIZEOF_INT128__ uses hardware unsigned __int128 multiply-accumulate and modinv_odd_128 (full 128-bit inverse) for ~4× speedup over software detail::uint128_t
p_adic_exp_impl
O(B)
O(1)
T ≤ 2·B terms; Σ xⁿ/n!; valuation-aware budget; early exit. Same unsigned __int128 optimization as p_adic_log
witt_log
O(N² + N·B)
O(N²)
GhostPowersFull::init + N × p_adic_log + ghost_recover; requires a[0] odd
witt_exp
O(N² + N·B)
O(N²)
GhostPowersFull::init + N × p_adic_exp + ghost_recover; requires a[0] ≡ 0 mod 4
witt_inverse
O(N²)
O(N²)
GhostPowersFull::init + N × modinv_odd on ghost + ghost_recover
Verification Helpers
Function
Time
Space
Notes
check_ghost_ring
O(N²)
O(N²)
3× ghost_vector + O(N) loop
check_witt_recovery_precision
O(N²)
O(N²)
2× ghost_vector + ghost_recover
check_witt_inverse
O(N²)
O(N²)
witt_inverse + operator* + O(N) verify
check_witt_exp_log_roundtrip
O(N² + N·B)
O(N²)
witt_log + witt_exp
5. compose.h — Composition & Reversion
Function
Time
Space
Notes
compose
O(N²·M²)
O(N·M)
Naive power accumulation: maintain Qᵏ, accumulate Pₖ·Qᵏ; result degree (N−1)(M−1); guarded ≤ 4095
Bareiss fraction-free elimination: 2×2 determinant recurrence, divides by previous pivot via dword_t; parity sign tracking
inverse() (M==N)
O(M³)
O(M²)
Gauss–Jordan on [A|I] with 2×M row width; modinv_odd row scaling
solve(b) (M==N)
O(M³)
O(M·(N+1))
Gauss–Jordan on [A|b] augmented system
11. pade.h — Padé Approximants
Function
Time
Space
Notes
pade_approximant<M,N>
O(N³ + M·N)
O(N² + M+N)
Builds N×N Toeplitz-like system from series; Gaussian elimination with odd-pivot modinv_odd; back-substitute for denominator q; convolution for numerator p. N=0 special case is O(M).
12. continued_fractions.h — Continued Fractions
Function
Time
Space
Notes
cf_expand(series, max_terms)
O(max_terms·S)
O(S)
Extract s[0], shift, repeat; S = series.size()
cf_convergent(c, n)
O(n²)
O(n)
Classical recurrence: Pₖ = cₖ·Pₖ₋₁ + Pₖ₋₂; degree grows linearly; total work Σk = O(n²)
reversion, Matrix::determinant/inverse/rref/solve/rank/is_singular, pade_approximant (dominates at O(N³)), poly_gcd_cw (worst case)
Key Design Decisions Affecting Complexity
No FFT/NTT over ℤ₂: Carry-chain multiplication is inherently O(N·M). The standard NTT requires a principal root of unity, which does not exist in ℤ/2^Wℤ for W > 1. Chunked convolution with auto-vectorization is the practical limit.
Newton vs Lagrange for reversion: The library uses classical Lagrange inversion (O(N³)) rather than Newton/Rothe–Henry (O(N log² N)). This is a deliberate choice for constexpr simplicity — Newton requires fast multiplication, which would need a sub-quadratic algorithm.
Ghost-op + Newton recovery for Witt ops: All Witt operations go through the ghost map (diagonalise the operation), then Newton-recover the Witt components. This is O(N²) but avoids per-component Hensel lifting.
Bareiss elimination for determinant: The fraction-free Bareiss algorithm avoids division by even pivots (which would fail in ℤ/2^Wℤ). The division by the previous pivot is done via dword_t → 2-adic division → modinv_odd.
Stirling/Pascal caches at compile time: Both are computed at compile time via inline constexpr globals computed once per (N,W). This avoids redundant O(N²) recomputation across multiple basis conversions.
Chunked stack buffer in poly_mul and mul_unsigned: Constant stack space O(1) regardless of polynomial size, using a CHUNK_COUNT-sized buffer (≤ 32 elements for uint64). Eliminates full-size quad_width array allocation for large products.
quad_width<W> accumulator precision: Used for ghost-map accumulation and unsaturated polynomial products. Provides 4× word width headroom, preventing overflow in intermediate sums. On uint64_t hot paths (poly_mul, mul_unsigned, div_unsigned_knuth, p-adic series), unsigned __int128 replaces detail::uint128_t where available for 3-4× hardware-accelerated arithmetic.
All constexpr: No runtime-only code paths. The entire library, including poly_mul, carry chains, Witt arithmetic, and division, supports compile-time evaluation.
Memory Usage Notes
Polynomial storage: Polynomial<N,W> is std::array<W,N> — exactly N × sizeof(W) bytes embedded in the object. No heap allocation.
Compile-time caches: StirlingCache<N,W> = 2 × N × N × sizeof(W) bytes in .rodata. PascalCache<N,W> = N × N × sizeof(W) bytes in .rodata. Both are inline constexpr shared across all TUs.
poly_mul / mul_unsigned chunked buffer: CHUNK_COUNT × sizeof(accum_t) bytes on stack. For uint64_t with unsigned __int128: 32 × 16 = 512 bytes. For smaller word widths: proportionally less.
Witt ghost_recover: Allocates N × N × sizeof(quad_width<W>) on the stack for the power table. For (N=32, W=uint64, quad_width=uint128_t): 32×32×16 = 16 KiB.
compose: Allocates 3 × result_degree × sizeof(W) on the stack. Result degree max 4095, so up to ~96 KiB for uint64_t.
reciprocal_newton: Allocates 4 × NR limbs on the stack, capped at 65536 bytes by static_assert.
DynamicPolynomial: Uses std::vector<W> — heap-allocated. All functions using it inherit vector allocation costs.