Skip to content

strided_perm::copy_into is 4x slower than Julia permutedims! for high-rank binary tensors #114

@shinaoka

Description

@shinaoka

Summary

strided_perm::copy_into is 4x slower than Julia's permutedims! when copying a 24-dimensional binary tensor (all dims = 2, 16M elements) with a scattered permutation.

Benchmark (AMD EPYC 7713P, 1T)

Reproducing step 408 of tensornetwork_permutation_light_415 — the single most expensive step (79% of total time).

Operation Rust (strided_perm) Julia (permutedims!)
Copy B (16M f64, scattered 24-dim perm) 236 ms 58 ms

The tensor has 24 dims of size 2 with a fully scattered permutation:

dims:    [2, 2, 2, ..., 2]  (24 dims)
strides: [16, 1024, 8388608, 4096, 1048576, 1, 8, 131072, ...]

Reproduction

See micro_bench/step408_bench.rs and micro_bench/step408_fair.jl in the benchmark suite.

Approach: Adopt Julia permutedims! strategy

Julia's Base.permutedims! uses a recursive dimension-stripping approach (see multidimensional.jl). The current strided_perm uses an HPTT-inspired blocked copy with 4×4 micro-kernel, which is optimized for large dims but has high overhead for rank-24 tensors where every dim is size 2.

Proposed fix: Add a specialized path in strided_perm::copy_into for high-rank, small-dim tensors that mirrors Julia's recursive approach — strip one dimension at a time, recurse on the remaining, with cache-friendly traversal order.

Context

This is the dominant bottleneck for tensor network contractions with many binary indices. Step 408 alone accounts for 79% of total contraction time. Related: #115, #116.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions