Skip to content

Port tensor4all-aci elementwise API and benchmarks#512

Merged
shinaoka merged 32 commits into
mainfrom
codex/aci-rust-api
May 22, 2026
Merged

Port tensor4all-aci elementwise API and benchmarks#512
shinaoka merged 32 commits into
mainfrom
codex/aci-rust-api

Conversation

@shinaoka
Copy link
Copy Markdown
Member

@shinaoka shinaoka commented May 22, 2026

Summary

  • Port tensor4all-aci as the Rust counterpart of AlternatingCrossInterpolation.jl elementwise ACI for tensor trains.
  • Add the public ACI API surface: elementwise, elementwise_batched, ElementwiseBatch, AciOptions, AciResult, typed errors, README examples, and doctested usage.
  • Wire ACI local updates through tensor4all-simplett, tensor4all-tcicore, and tensor4all-tensorbackend, including deterministic tests against dense oracles.
  • Add Rust/Julia benchmark runners and saved benchmark results for elementwise ACI, local-step bucket timing, MatrixLUCI, and standalone MatrixLU.
  • Align the Rust convergence check with Julia's convergencecriterion shape and reject min_iters = 0 as an invalid option.

Performance Work Included

  • Reduce Matrix <-> tenferro TypedTensor conversion overhead with owned matrix and batched-GEMM paths.
  • Add an owned MatrixLUCI factorization path for ACI local matrices.
  • Optimize rrLU hot loops over prevalidated column-major slices, keeping unchecked indexing localized in small helpers.
  • Store the durable dense-loop lesson in REPOSITORY_RULES.md: validate ranges once, then iterate over column-major slices in hot paths.

Benchmark Snapshot

L=16 fixed-sweep local-step medians, Rust built with tenferro-system-blas against Homebrew OpenBLAS:

chi Rust total ms Julia total ms Rust MatrixLUCI ms Julia MatrixLUCI ms rank
16 1.841535 2.115942 1.259810 1.233481 33
32 3.330730 4.085233 2.209271 2.414964 46
64 7.975667 9.584417 4.708000 5.213416 63
128 15.889415 17.230575 7.987332 8.276169 76

Saved details are in benchmarks/results/2026-05-22-aci-local-step-l16-openblas.md and docs/plans/2026-05-22-aci-rust-api-handoff.md.

Validation

  • git diff --check
  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --release -p tensor4all-tensorbackend --lib
  • cargo test --release -p tensor4all-tcicore
  • cargo test --release -p tensor4all-aci
  • cargo test --doc --release -p tensor4all-tensorbackend
  • cargo test --doc --release -p tensor4all-aci
  • cargo test --release -p tensor4all-aci --no-default-features --features tenferro-system-blas with RUSTFLAGS, RUSTDOCFLAGS, and DYLD_LIBRARY_PATH pointed at Homebrew OpenBLAS

Notes

  • MatrixLU standalone still trails Julia on the saved Hilbert microbenchmark; this is tracked separately because the ACI local MatrixLUCI bucket is now close under the L=16 fixture.
  • Upstream AlternatingCrossInterpolation.jl currently lists Marc Ritter and contributors in its Project.toml/LICENSE, while the paper citation is Marc Ritter.

@shinaoka shinaoka marked this pull request as ready for review May 22, 2026 04:50
@shinaoka shinaoka enabled auto-merge May 22, 2026 04:50
@shinaoka shinaoka changed the title Optimize ACI local update performance Port tensor4all-aci elementwise API and benchmarks May 22, 2026
@shinaoka shinaoka merged commit de480ba into main May 22, 2026
6 checks passed
@shinaoka shinaoka deleted the codex/aci-rust-api branch May 22, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant