Skip to content

One simd multiple line#20

Open
shuoli84 wants to merge 2 commits intojonhoo:mainfrom
shuoli84:one-simd-multiple-line
Open

One simd multiple line#20
shuoli84 wants to merge 2 commits intojonhoo:mainfrom
shuoli84:one-simd-multiple-line

Conversation

@shuoli84
Copy link
Copy Markdown

this modification optimized find_newline, based on one observation that one simd instruction actually generates multiple new lines. Previously just process one and discard others, then the same data actually refilled in simd for processing.
I optimized by:

  1. extend the simd lanes to 64, which holds about 4 new lines at 80%.
  2. then use bit shift and count trailing zeros to jump to the new_line index faster.

The performance gain is around 5-10% in my box.

@b-i-z
Copy link
Copy Markdown

b-i-z commented Dec 14, 2025

Reusing the bit mask to find multiple line endings is something I tried too. I also tried splitting the data into aligned chunks with slice::align_to in the hope the compiler could generate more efficient code. Something like:

const LANES: usize = 64;
#[repr(align(64))]
struct AlignedData {
    raw_data: [u8; LANES],
}

// because repr(align) doesn't accept constants, we should check it doesn't get out of sync with LANES size
const _: () = debug_assert!(align_of::<AlignedData>() == LANES);

fn get_bitmask(aligned_data: &AlignedData) -> u64 {
    //...
}

fn one(map: &[u8]) {
    let (start, aligned, end) = unsafe { map.align_to::<AlignedData>() };
    for byte in start {
        //...
    }
    for chunk in aligned {
        let mut mask = get_bitmask(&chunk);
        //...
    }
    for byte in end {
        //...
    }
}

I didn't test what effect the aligned data had on performance by itself, only combined with the multiple line endings per bit mask, but with both features I got a small speedup like you did.

Whether 64 vs 32 lanes makes a difference to performance probably depend whether you have AVX512 (AVX512BW) support or not. With 256 bit (AVX2) your CPU can only process 256/8 = 32 bytes at a time in parallel, but perhaps if it is providing a speedup on non-AVX512 CPUs, then maybe the code the compiler generates to break it down into separate calls and combine the result might be more efficient than the manually written code.

The only source of percentage of CPUs with AVX2 vs AVX512 support I could find was the Steam November 2025 hardware & software survey, which gives about 95% AVX2 support and about 20% AVX512 support (AVX512BW isn't listed, but since it's listed in the x86-64-v4 microarchitecture level https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels I assume it's supported on the hardware that supports AVX512F/AVX512CD etc.).

@shuoli84
Copy link
Copy Markdown
Author

I also plan to try the aligned data, good to know you tried and it works somehow. :) Also I tried match against both linebreak and semi at one function, then with some bitwise op, get a separator mask, which strictly follows name temp name temp, that idea also improved the bench like 5-10%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants