Summary
strided_perm::copy_into is 4x slower than Julia's permutedims! when copying a 24-dimensional binary tensor (all dims = 2, 16M elements) with a scattered permutation.
Benchmark (AMD EPYC 7713P, 1T)
Reproducing step 408 of tensornetwork_permutation_light_415 — the single most expensive step (79% of total time).
| Operation |
Rust (strided_perm) |
Julia (permutedims!) |
| Copy B (16M f64, scattered 24-dim perm) |
236 ms |
58 ms |
The tensor has 24 dims of size 2 with a fully scattered permutation:
dims: [2, 2, 2, ..., 2] (24 dims)
strides: [16, 1024, 8388608, 4096, 1048576, 1, 8, 131072, ...]
Reproduction
See micro_bench/step408_bench.rs and micro_bench/step408_fair.jl in the benchmark suite.
Approach: Adopt Julia permutedims! strategy
Julia's Base.permutedims! uses a recursive dimension-stripping approach (see multidimensional.jl). The current strided_perm uses an HPTT-inspired blocked copy with 4×4 micro-kernel, which is optimized for large dims but has high overhead for rank-24 tensors where every dim is size 2.
Proposed fix: Add a specialized path in strided_perm::copy_into for high-rank, small-dim tensors that mirrors Julia's recursive approach — strip one dimension at a time, recurse on the remaining, with cache-friendly traversal order.
Context
This is the dominant bottleneck for tensor network contractions with many binary indices. Step 408 alone accounts for 79% of total contraction time. Related: #115, #116.
Summary
strided_perm::copy_intois 4x slower than Julia'spermutedims!when copying a 24-dimensional binary tensor (all dims = 2, 16M elements) with a scattered permutation.Benchmark (AMD EPYC 7713P, 1T)
Reproducing step 408 of
tensornetwork_permutation_light_415— the single most expensive step (79% of total time).strided_perm)permutedims!)The tensor has 24 dims of size 2 with a fully scattered permutation:
Reproduction
See
micro_bench/step408_bench.rsandmicro_bench/step408_fair.jlin the benchmark suite.Approach: Adopt Julia
permutedims!strategyJulia's
Base.permutedims!uses a recursive dimension-stripping approach (see multidimensional.jl). The currentstrided_permuses an HPTT-inspired blocked copy with 4×4 micro-kernel, which is optimized for large dims but has high overhead for rank-24 tensors where every dim is size 2.Proposed fix: Add a specialized path in
strided_perm::copy_intofor high-rank, small-dim tensors that mirrors Julia's recursive approach — strip one dimension at a time, recurse on the remaining, with cache-friendly traversal order.Context
This is the dominant bottleneck for tensor network contractions with many binary indices. Step 408 alone accounts for 79% of total contraction time. Related: #115, #116.