One simd multiple line#20
Conversation
|
Reusing the bit mask to find multiple line endings is something I tried too. I also tried splitting the data into aligned chunks with slice::align_to in the hope the compiler could generate more efficient code. Something like: const LANES: usize = 64;
#[repr(align(64))]
struct AlignedData {
raw_data: [u8; LANES],
}
// because repr(align) doesn't accept constants, we should check it doesn't get out of sync with LANES size
const _: () = debug_assert!(align_of::<AlignedData>() == LANES);
fn get_bitmask(aligned_data: &AlignedData) -> u64 {
//...
}
fn one(map: &[u8]) {
let (start, aligned, end) = unsafe { map.align_to::<AlignedData>() };
for byte in start {
//...
}
for chunk in aligned {
let mut mask = get_bitmask(&chunk);
//...
}
for byte in end {
//...
}
}I didn't test what effect the aligned data had on performance by itself, only combined with the multiple line endings per bit mask, but with both features I got a small speedup like you did. Whether 64 vs 32 lanes makes a difference to performance probably depend whether you have AVX512 (AVX512BW) support or not. With 256 bit (AVX2) your CPU can only process 256/8 = 32 bytes at a time in parallel, but perhaps if it is providing a speedup on non-AVX512 CPUs, then maybe the code the compiler generates to break it down into separate calls and combine the result might be more efficient than the manually written code. The only source of percentage of CPUs with AVX2 vs AVX512 support I could find was the Steam November 2025 hardware & software survey, which gives about 95% AVX2 support and about 20% AVX512 support (AVX512BW isn't listed, but since it's listed in the x86-64-v4 microarchitecture level https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels I assume it's supported on the hardware that supports AVX512F/AVX512CD etc.). |
|
I also plan to try the aligned data, good to know you tried and it works somehow. :) Also I tried match against both linebreak and semi at one function, then with some bitwise op, get a separator mask, which strictly follows name temp name temp, that idea also improved the bench like 5-10%. |
this modification optimized find_newline, based on one observation that one simd instruction actually generates multiple new lines. Previously just process one and discard others, then the same data actually refilled in simd for processing.
I optimized by:
The performance gain is around 5-10% in my box.