One simd multiple line by shuoli84 · Pull Request #20 · jonhoo/brrr

shuoli84 · 2025-12-13T05:59:42Z

this modification optimized find_newline, based on one observation that one simd instruction actually generates multiple new lines. Previously just process one and discard others, then the same data actually refilled in simd for processing.
I optimized by:

extend the simd lanes to 64, which holds about 4 new lines at 80%.
then use bit shift and count trailing zeros to jump to the new_line index faster.

The performance gain is around 5-10% in my box.

b-i-z · 2025-12-14T16:32:05Z

Reusing the bit mask to find multiple line endings is something I tried too. I also tried splitting the data into aligned chunks with slice::align_to in the hope the compiler could generate more efficient code. Something like:

const LANES: usize = 64;
#[repr(align(64))]
struct AlignedData {
    raw_data: [u8; LANES],
}

// because repr(align) doesn't accept constants, we should check it doesn't get out of sync with LANES size
const _: () = debug_assert!(align_of::<AlignedData>() == LANES);

fn get_bitmask(aligned_data: &AlignedData) -> u64 {
    //...
}

fn one(map: &[u8]) {
    let (start, aligned, end) = unsafe { map.align_to::<AlignedData>() };
    for byte in start {
        //...
    }
    for chunk in aligned {
        let mut mask = get_bitmask(&chunk);
        //...
    }
    for byte in end {
        //...
    }
}

I didn't test what effect the aligned data had on performance by itself, only combined with the multiple line endings per bit mask, but with both features I got a small speedup like you did.

Whether 64 vs 32 lanes makes a difference to performance probably depend whether you have AVX512 (AVX512BW) support or not. With 256 bit (AVX2) your CPU can only process 256/8 = 32 bytes at a time in parallel, but perhaps if it is providing a speedup on non-AVX512 CPUs, then maybe the code the compiler generates to break it down into separate calls and combine the result might be more efficient than the manually written code.

The only source of percentage of CPUs with AVX2 vs AVX512 support I could find was the Steam November 2025 hardware & software survey, which gives about 95% AVX2 support and about 20% AVX512 support (AVX512BW isn't listed, but since it's listed in the x86-64-v4 microarchitecture level https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels I assume it's supported on the hardware that supports AVX512F/AVX512CD etc.).

shuoli84 · 2025-12-16T14:40:40Z

I also plan to try the aligned data, good to know you tried and it works somehow. :) Also I tried match against both linebreak and semi at one function, then with some bitwise op, get a separator mask, which strictly follows name temp name temp, that idea also improved the bench like 5-10%.

shuoli84 added 2 commits December 13, 2025 13:44

process multiple lines for one simd inst

d2c6cf7

remove assert!

d3eb8a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One simd multiple line#20

One simd multiple line#20
shuoli84 wants to merge 2 commits intojonhoo:mainfrom
shuoli84:one-simd-multiple-line

shuoli84 commented Dec 13, 2025

Uh oh!

b-i-z commented Dec 14, 2025

Uh oh!

shuoli84 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shuoli84 commented Dec 13, 2025

Uh oh!

b-i-z commented Dec 14, 2025

Uh oh!

shuoli84 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants