Skip to content

Swap rayon for orx-parallel#1413

Closed
TechnoPorg wants to merge 5 commits intowild-linker:mainfrom
TechnoPorg:no-rayon
Closed

Swap rayon for orx-parallel#1413
TechnoPorg wants to merge 5 commits intowild-linker:mainfrom
TechnoPorg:no-rayon

Conversation

@TechnoPorg
Copy link
Copy Markdown
Contributor

@TechnoPorg TechnoPorg commented Dec 31, 2025

As discussed in #1072, exploring Rayon alternatives has the potential to help performance. Of the options proposed in that discussion, orx-parallel seemed to me like the best fit (although there may be other considerations that I've neglected).

I haven't finished porting everything, in particular the graph algorithms and some specific parallel iterator cases that orx-parallel doesn't support right now, like enumeration; however, I'm getting some pretty good numbers already:

poop "/tmp/wild/ripgrep/6/run-with wild" "/tmp/wild/ripgrep/6/run-with /home/maya/Documents/Programming/wild/target/release/wild"
Benchmark 1 (124 runs): /tmp/wild/ripgrep/6/run-with wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          36.1ms ± 3.49ms    15.2ms … 41.9ms          6 ( 5%)        0%
  peak_rss           5.09MB ±  153KB    4.60MB … 5.61MB         30 (24%)        0%
  cpu_cycles          125M  ± 8.43M      107M  …  144M           0 ( 0%)        0%
  instructions        105M  ± 9.17M     80.0M  …  122M           4 ( 3%)        0%
  cache_references   1.58M  ± 68.7K     1.38M  … 1.73M           2 ( 2%)        0%
  cache_misses        142K  ± 24.3K     86.4K  …  208K           1 ( 1%)        0%
  branch_misses       258K  ± 18.7K      212K  …  303K           0 ( 0%)        0%
Benchmark 2 (130 runs): /tmp/wild/ripgrep/6/run-with /home/maya/Documents/Programming/wild/target/release/wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          34.9ms ± 4.18ms    18.2ms … 40.8ms         15 (12%)          -  3.3% ±  2.6%
  peak_rss           5.22MB ±  234KB    4.58MB … 5.68MB          1 ( 1%)        💩+  2.4% ±  1.0%
  cpu_cycles         78.0M  ± 4.81M     65.6M  … 90.2M           0 ( 0%)        ⚡- 37.6% ±  1.3%
  instructions       77.2M  ± 8.75M     48.8M  … 95.6M           5 ( 4%)        ⚡- 26.2% ±  2.1%
  cache_references   1.17M  ± 64.5K      966K  … 1.32M           9 ( 7%)        ⚡- 26.0% ±  1.0%
  cache_misses        112K  ± 22.6K     45.8K  …  177K           4 ( 3%)        ⚡- 21.1% ±  4.1%
  branch_misses       200K  ± 20.1K      135K  …  246K           6 ( 5%)        ⚡- 22.4% ±  1.9%

This PR also depends on some changes to orx-parallel:

Let me know whether this is a direction you'd like me to continue working in.

@TechnoPorg TechnoPorg force-pushed the no-rayon branch 2 times, most recently from a06a270 to fc66157 Compare December 31, 2025 12:55
@TechnoPorg
Copy link
Copy Markdown
Contributor Author

I'm not sure if it will be possible to replace rayon for the parallel sorting use case, as a quick search hasn't turned up any other crates that do the same.

@davidlattimore
Copy link
Copy Markdown
Member

Great to see experimentation on this! What system are you benchmarking on? I tried running some benchmarks myself and on my systems, I see a slowdown. 7% slower at linking a release build of ripgrep. 11% slower at linking wild and zed. That's on a 16 core, 32 thread ryzen. On my laptop with 4 cores and 8 threads, I saw a 24% slowdown. I haven't yet looked to see if I can figure out what's slowing down. Possibly there's something that can be fixed.

Benchmark 1 (508 runs): /home/david/save/ripgrep/run-with /home/d/wild-builds/2026-01-01 --strip-debug --no-fork
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          19.6ms ± 1.98ms    18.2ms … 63.0ms          4 ( 1%)        0%
  peak_rss           66.1MB ±  685KB    64.4MB … 68.3MB          1 ( 0%)        0%
  cpu_cycles          317M  ± 9.47M      291M  …  348M           4 ( 1%)        0%
  instructions        327M  ± 7.50M      310M  …  353M           2 ( 0%)        0%
  cache_references   10.0M  ±  198K     9.42M  … 10.8M           3 ( 1%)        0%
  cache_misses       1.86M  ± 35.6K     1.76M  … 1.97M           4 ( 1%)        0%
  branch_misses       853K  ± 13.3K      815K  …  909K           5 ( 1%)        0%
Benchmark 2 (473 runs): /home/david/save/ripgrep/run-with target/release/wild --strip-debug --no-fork
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          21.1ms ±  479us    20.0ms … 22.7ms         10 ( 2%)        💩+  7.6% ±  0.9%
  peak_rss           68.9MB ±  727KB    67.0MB … 71.0MB          4 ( 1%)        💩+  4.2% ±  0.1%
  cpu_cycles          209M  ± 7.68M      186M  …  235M           6 ( 1%)        ⚡- 33.9% ±  0.3%
  instructions        264M  ± 5.57M      250M  …  290M           5 ( 1%)        ⚡- 19.3% ±  0.3%
  cache_references   8.55M  ±  172K     8.07M  … 9.09M           3 ( 1%)        ⚡- 14.8% ±  0.2%
  cache_misses       1.61M  ± 28.7K     1.52M  … 1.70M           3 ( 1%)        ⚡- 13.5% ±  0.2%
  branch_misses       770K  ± 11.7K      740K  …  810K           5 ( 1%)        ⚡-  9.8% ±  0.2%

CPU cycles and instruction counts are down, so my guess is that there's less parallelism happening somewhere.

@TechnoPorg
Copy link
Copy Markdown
Contributor Author

Hmm, that's very interesting. I'm running my benchmarks on an i9 with 24 cores and 32 threads.

I see a 3-4% speedup when linking wild, but a 5% slowdown when linking Zed.

@TechnoPorg
Copy link
Copy Markdown
Contributor Author

@davidlattimore I've collected a fair bit more data on this PR now, which I would appreciate your insight into, since I'm not as familiar with the ins and outs of wild performance. For now, I've gotten perfetto support on orx-parallel working by using Rayon's thread pool with it, which I thought would be best for enabling a one-to-one comparison.

All the data below is from linking a debug build of zed. It looks like the lion's share of the slowdown is happening when writing the output file.

wild f0650fc --time --no-fork:

Details
└─    1.56 Activate thread pool
┌───   17.96 Open input files
│ ┌───    0.00 Process linker scripts
│ ├───    0.69 Group files
│ ├───   36.48 Read symbols
└─   16.48 Populate symbol map
├─┴─   54.80 Load inputs into symbol DB
├───   40.40 Resolve symbols
├───    1.56 Resolve alternative symbol definitions
│ ┌───   44.61 Resolve sections
│ ├───    5.25 Assign section IDs
│ ├───    0.03 Canonicalise undefined symbols
├─┴─   49.93 Section resolution
│ ┌───    0.25 Build SONAME index
└─   46.08 Merge strings
└─  142.73 Find required sections
│ ├───    0.14 Finalise copy relocations
│ ├───    0.02 Merge dynamic symbol definitions
│ ├───    0.24 Merge GNU property notes
│ ├───    0.25 Merge e_flags
│ ├───    0.15 Merge .riscv.attributes sections
│ ├───   26.02 Finalise per-object sizes
│ ├───    0.17 Apply non-addressable indexes
│ ├───    0.04 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.09 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.12 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
└─   36.87 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─  210.14 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.27 Split output buffers by group
│ ├─┴─ 1038.32 Write data to file
│ ├───    7.46 Sort .eh_frame_hdr
│ ├───   34.72 Compute build ID
│ ├───   18.33 Unmap output file
├─┴─ 1098.98 Write output file
│ ┌───    0.43 Verify inputs unchanged
│ ├───    8.08 Drop layout
│ ├───   34.07 Drop inputs
├─┴─   42.63 Shutdown
└─ 1516.49 Link

wild e1c79f2 --time --no-fork:

Details
└─    1.29 Activate thread pool
┌───   16.31 Open input files
│ ┌───    0.00 Process linker scripts
│ ├───    0.78 Group files
│ ├───   38.93 Read symbols
│ ├───   15.87 Populate symbol map
├─┴─   56.57 Load inputs into symbol DB
├───   43.33 Resolve symbols
├───    1.60 Resolve alternative symbol definitions
│ ┌───   43.27 Resolve sections
│ ├───    5.62 Assign section IDs
│ ├───    0.03 Canonicalise undefined symbols
├─┴─   48.99 Section resolution
│ ┌───    0.22 Build SONAME index
└─   57.64 Merge strings
└─  143.07 Find required sections
│ ├───    0.10 Finalise copy relocations
│ ├───    0.01 Merge dynamic symbol definitions
│ ├───    0.32 Merge GNU property notes
│ ├───    0.34 Merge e_flags
│ ├───    0.18 Merge .riscv.attributes sections
│ ├───   25.60 Finalise per-object sizes
│ ├───    0.17 Apply non-addressable indexes
│ ├───    0.04 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.09 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.12 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───   37.17 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─  210.17 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.37 Split output buffers by group
│ ├─┴─  653.31 Write data to file
│ ├───    5.18 Sort .eh_frame_hdr
│ ├───   34.39 Compute build ID
│ ├───   18.68 Unmap output file
├─┴─  711.72 Write output file
│ ┌───    0.33 Verify inputs unchanged
│ ├───    7.18 Drop layout
│ ├───   30.91 Drop inputs
├─┴─   38.46 Shutdown
└─ 1127.29 Link

Perfetto traces:
Pftraces.zip

@mati865
Copy link
Copy Markdown
Member

mati865 commented Jan 15, 2026

What filesystem are you writing the outputs onto? This doesn't look like a tmpfs.

@davidlattimore
Copy link
Copy Markdown
Member

Thanks, that's very helpful for figuring out what's going on.

Here's what the end of the write phase looks like with rayon:

image

The unit of work during the write phase is the group. Note that while some threads finish working before other threads, the threads that keep working longer are just finishing the group that they were already processing. No new groups are started after the first threads finish working

Compare that with the orx-parallel-based write-phase:

image

See how some threads are starting in some cases as many as 3 new groups after the earliest threads finish.

That to me suggests that orx-parallel isn't able to redistribute work at the granularity of a single group. Or maybe it doesn't do work redistribution at all? I can't remember which parallelism libraries do work stealing / heartbeat scheduling and which don't.

@davidlattimore
Copy link
Copy Markdown
Member

From these docs, I get the impression that there's probably a shared work queue that the workers take work from and that they take some number of work items at a time. Perhaps set the chunk_size to 1 for the write phase?

@TechnoPorg
Copy link
Copy Markdown
Contributor Author

What filesystem are you writing the outputs onto? This doesn't look like a tmpfs.

Whoops, I forgot the tmpfs when benchmarking, so it ended up being on btrfs instead.

@mati865
Copy link
Copy Markdown
Member

mati865 commented Jan 16, 2026

Yeah, btrfs has many strengths, but it's the worst FS for parallel writing: https://gist.github.com/mati865/7817cc637f15435f536b81f05575bb21

@TechnoPorg
Copy link
Copy Markdown
Contributor Author

Closing for now as per the discussion on Zulip. I think this is still worth investigating out-of-tree, but right now the improvements aren't significant enough to move forward with this PR.

@TechnoPorg TechnoPorg closed this Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants