Swap rayon for orx-parallel by TechnoPorg · Pull Request #1413 · wild-linker/wild

TechnoPorg · 2025-12-31T12:44:52Z

As discussed in #1072, exploring Rayon alternatives has the potential to help performance. Of the options proposed in that discussion, orx-parallel seemed to me like the best fit (although there may be other considerations that I've neglected).

I haven't finished porting everything, in particular the graph algorithms and some specific parallel iterator cases that orx-parallel doesn't support right now, like enumeration; however, I'm getting some pretty good numbers already:

poop "/tmp/wild/ripgrep/6/run-with wild" "/tmp/wild/ripgrep/6/run-with /home/maya/Documents/Programming/wild/target/release/wild"
Benchmark 1 (124 runs): /tmp/wild/ripgrep/6/run-with wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          36.1ms ± 3.49ms    15.2ms … 41.9ms          6 ( 5%)        0%
  peak_rss           5.09MB ±  153KB    4.60MB … 5.61MB         30 (24%)        0%
  cpu_cycles          125M  ± 8.43M      107M  …  144M           0 ( 0%)        0%
  instructions        105M  ± 9.17M     80.0M  …  122M           4 ( 3%)        0%
  cache_references   1.58M  ± 68.7K     1.38M  … 1.73M           2 ( 2%)        0%
  cache_misses        142K  ± 24.3K     86.4K  …  208K           1 ( 1%)        0%
  branch_misses       258K  ± 18.7K      212K  …  303K           0 ( 0%)        0%
Benchmark 2 (130 runs): /tmp/wild/ripgrep/6/run-with /home/maya/Documents/Programming/wild/target/release/wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          34.9ms ± 4.18ms    18.2ms … 40.8ms         15 (12%)          -  3.3% ±  2.6%
  peak_rss           5.22MB ±  234KB    4.58MB … 5.68MB          1 ( 1%)        💩+  2.4% ±  1.0%
  cpu_cycles         78.0M  ± 4.81M     65.6M  … 90.2M           0 ( 0%)        ⚡- 37.6% ±  1.3%
  instructions       77.2M  ± 8.75M     48.8M  … 95.6M           5 ( 4%)        ⚡- 26.2% ±  2.1%
  cache_references   1.17M  ± 64.5K      966K  … 1.32M           9 ( 7%)        ⚡- 26.0% ±  1.0%
  cache_misses        112K  ± 22.6K     45.8K  …  177K           4 ( 3%)        ⚡- 21.1% ±  4.1%
  branch_misses       200K  ± 20.1K      135K  …  246K           6 ( 5%)        ⚡- 22.4% ±  1.9%

This PR also depends on some changes to orx-parallel:

Let me know whether this is a direction you'd like me to continue working in.

TechnoPorg · 2025-12-31T12:57:36Z

I'm not sure if it will be possible to replace rayon for the parallel sorting use case, as a quick search hasn't turned up any other crates that do the same.

davidlattimore · 2025-12-31T23:35:15Z

Great to see experimentation on this! What system are you benchmarking on? I tried running some benchmarks myself and on my systems, I see a slowdown. 7% slower at linking a release build of ripgrep. 11% slower at linking wild and zed. That's on a 16 core, 32 thread ryzen. On my laptop with 4 cores and 8 threads, I saw a 24% slowdown. I haven't yet looked to see if I can figure out what's slowing down. Possibly there's something that can be fixed.

Benchmark 1 (508 runs): /home/david/save/ripgrep/run-with /home/d/wild-builds/2026-01-01 --strip-debug --no-fork
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          19.6ms ± 1.98ms    18.2ms … 63.0ms          4 ( 1%)        0%
  peak_rss           66.1MB ±  685KB    64.4MB … 68.3MB          1 ( 0%)        0%
  cpu_cycles          317M  ± 9.47M      291M  …  348M           4 ( 1%)        0%
  instructions        327M  ± 7.50M      310M  …  353M           2 ( 0%)        0%
  cache_references   10.0M  ±  198K     9.42M  … 10.8M           3 ( 1%)        0%
  cache_misses       1.86M  ± 35.6K     1.76M  … 1.97M           4 ( 1%)        0%
  branch_misses       853K  ± 13.3K      815K  …  909K           5 ( 1%)        0%
Benchmark 2 (473 runs): /home/david/save/ripgrep/run-with target/release/wild --strip-debug --no-fork
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          21.1ms ±  479us    20.0ms … 22.7ms         10 ( 2%)        💩+  7.6% ±  0.9%
  peak_rss           68.9MB ±  727KB    67.0MB … 71.0MB          4 ( 1%)        💩+  4.2% ±  0.1%
  cpu_cycles          209M  ± 7.68M      186M  …  235M           6 ( 1%)        ⚡- 33.9% ±  0.3%
  instructions        264M  ± 5.57M      250M  …  290M           5 ( 1%)        ⚡- 19.3% ±  0.3%
  cache_references   8.55M  ±  172K     8.07M  … 9.09M           3 ( 1%)        ⚡- 14.8% ±  0.2%
  cache_misses       1.61M  ± 28.7K     1.52M  … 1.70M           3 ( 1%)        ⚡- 13.5% ±  0.2%
  branch_misses       770K  ± 11.7K      740K  …  810K           5 ( 1%)        ⚡-  9.8% ±  0.2%

CPU cycles and instruction counts are down, so my guess is that there's less parallelism happening somewhere.

TechnoPorg · 2026-01-01T06:58:35Z

Hmm, that's very interesting. I'm running my benchmarks on an i9 with 24 cores and 32 threads.

I see a 3-4% speedup when linking wild, but a 5% slowdown when linking Zed.

TechnoPorg · 2026-01-15T20:33:08Z

@davidlattimore I've collected a fair bit more data on this PR now, which I would appreciate your insight into, since I'm not as familiar with the ins and outs of wild performance. For now, I've gotten perfetto support on orx-parallel working by using Rayon's thread pool with it, which I thought would be best for enabling a one-to-one comparison.

All the data below is from linking a debug build of zed. It looks like the lion's share of the slowdown is happening when writing the output file.

wild f0650fc --time --no-fork:

Details

└─    1.56 Activate thread pool
┌───   17.96 Open input files
│ ┌───    0.00 Process linker scripts
│ ├───    0.69 Group files
│ ├───   36.48 Read symbols
└─   16.48 Populate symbol map
├─┴─   54.80 Load inputs into symbol DB
├───   40.40 Resolve symbols
├───    1.56 Resolve alternative symbol definitions
│ ┌───   44.61 Resolve sections
│ ├───    5.25 Assign section IDs
│ ├───    0.03 Canonicalise undefined symbols
├─┴─   49.93 Section resolution
│ ┌───    0.25 Build SONAME index
└─   46.08 Merge strings
└─  142.73 Find required sections
│ ├───    0.14 Finalise copy relocations
│ ├───    0.02 Merge dynamic symbol definitions
│ ├───    0.24 Merge GNU property notes
│ ├───    0.25 Merge e_flags
│ ├───    0.15 Merge .riscv.attributes sections
│ ├───   26.02 Finalise per-object sizes
│ ├───    0.17 Apply non-addressable indexes
│ ├───    0.04 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.09 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.12 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
└─   36.87 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─  210.14 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.27 Split output buffers by group
│ ├─┴─ 1038.32 Write data to file
│ ├───    7.46 Sort .eh_frame_hdr
│ ├───   34.72 Compute build ID
│ ├───   18.33 Unmap output file
├─┴─ 1098.98 Write output file
│ ┌───    0.43 Verify inputs unchanged
│ ├───    8.08 Drop layout
│ ├───   34.07 Drop inputs
├─┴─   42.63 Shutdown
└─ 1516.49 Link

wild e1c79f2 --time --no-fork:

Details

└─    1.29 Activate thread pool
┌───   16.31 Open input files
│ ┌───    0.00 Process linker scripts
│ ├───    0.78 Group files
│ ├───   38.93 Read symbols
│ ├───   15.87 Populate symbol map
├─┴─   56.57 Load inputs into symbol DB
├───   43.33 Resolve symbols
├───    1.60 Resolve alternative symbol definitions
│ ┌───   43.27 Resolve sections
│ ├───    5.62 Assign section IDs
│ ├───    0.03 Canonicalise undefined symbols
├─┴─   48.99 Section resolution
│ ┌───    0.22 Build SONAME index
└─   57.64 Merge strings
└─  143.07 Find required sections
│ ├───    0.10 Finalise copy relocations
│ ├───    0.01 Merge dynamic symbol definitions
│ ├───    0.32 Merge GNU property notes
│ ├───    0.34 Merge e_flags
│ ├───    0.18 Merge .riscv.attributes sections
│ ├───   25.60 Finalise per-object sizes
│ ├───    0.17 Apply non-addressable indexes
│ ├───    0.04 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.09 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.12 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───   37.17 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─  210.17 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.37 Split output buffers by group
│ ├─┴─  653.31 Write data to file
│ ├───    5.18 Sort .eh_frame_hdr
│ ├───   34.39 Compute build ID
│ ├───   18.68 Unmap output file
├─┴─  711.72 Write output file
│ ┌───    0.33 Verify inputs unchanged
│ ├───    7.18 Drop layout
│ ├───   30.91 Drop inputs
├─┴─   38.46 Shutdown
└─ 1127.29 Link

Perfetto traces:
Pftraces.zip

…rf data

mati865 · 2026-01-15T20:49:57Z

What filesystem are you writing the outputs onto? This doesn't look like a tmpfs.

davidlattimore · 2026-01-15T21:44:44Z

Thanks, that's very helpful for figuring out what's going on.

Here's what the end of the write phase looks like with rayon:

The unit of work during the write phase is the group. Note that while some threads finish working before other threads, the threads that keep working longer are just finishing the group that they were already processing. No new groups are started after the first threads finish working

Compare that with the orx-parallel-based write-phase:

See how some threads are starting in some cases as many as 3 new groups after the earliest threads finish.

That to me suggests that orx-parallel isn't able to redistribute work at the granularity of a single group. Or maybe it doesn't do work redistribution at all? I can't remember which parallelism libraries do work stealing / heartbeat scheduling and which don't.

davidlattimore · 2026-01-15T23:00:02Z

From these docs, I get the impression that there's probably a shared work queue that the workers take work from and that they take some number of work items at a time. Perhaps set the chunk_size to 1 for the write phase?

TechnoPorg · 2026-01-16T06:07:55Z

What filesystem are you writing the outputs onto? This doesn't look like a tmpfs.

Whoops, I forgot the tmpfs when benchmarking, so it ended up being on btrfs instead.

mati865 · 2026-01-16T08:05:44Z

Yeah, btrfs has many strengths, but it's the worst FS for parallel writing: https://gist.github.com/mati865/7817cc637f15435f536b81f05575bb21

TechnoPorg · 2026-02-08T09:28:05Z

Closing for now as per the discussion on Zulip. I think this is still worth investigating out-of-tree, but right now the improvements aren't significant enough to move forward with this PR.

TechnoPorg force-pushed the no-rayon branch 2 times, most recently from a06a270 to fc66157 Compare December 31, 2025 12:55

Replace straight into_par_* with orx-parallel equivalents

8b55ec4

TechnoPorg force-pushed the no-rayon branch from fc66157 to 8b55ec4 Compare January 15, 2026 20:21

Use rayon's thread pool with orx-parallel for more straightforward pe…

1f9b0e8

…rf data

TechnoPorg force-pushed the no-rayon branch from f0650fc to 1f9b0e8 Compare January 15, 2026 20:35

Decrease chunk size to one for output phase

9fe9308

TechnoPorg mentioned this pull request Jan 16, 2026

Add enumerate to parallel iterators orxfun/orx-parallel#127

Merged

Use orx-parallel branch with enumerate

844ee48

TechnoPorg force-pushed the no-rayon branch from 6b1fcdc to d6a1e7f Compare January 16, 2026 21:52

Fix accidental formatting changes

59e02a0

TechnoPorg force-pushed the no-rayon branch from d6a1e7f to 59e02a0 Compare January 26, 2026 20:19

TechnoPorg closed this Feb 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Swap rayon for orx-parallel#1413

Swap rayon for orx-parallel#1413
TechnoPorg wants to merge 5 commits intowild-linker:mainfrom
TechnoPorg:no-rayon

TechnoPorg commented Dec 31, 2025 •

edited

Loading

Uh oh!

TechnoPorg commented Dec 31, 2025

Uh oh!

davidlattimore commented Dec 31, 2025

Uh oh!

TechnoPorg commented Jan 1, 2026

Uh oh!

TechnoPorg commented Jan 15, 2026

Uh oh!

mati865 commented Jan 15, 2026

Uh oh!

davidlattimore commented Jan 15, 2026

Uh oh!

davidlattimore commented Jan 15, 2026

Uh oh!

TechnoPorg commented Jan 16, 2026

Uh oh!

mati865 commented Jan 16, 2026

Uh oh!

TechnoPorg commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TechnoPorg commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TechnoPorg commented Dec 31, 2025

Uh oh!

davidlattimore commented Dec 31, 2025

Uh oh!

TechnoPorg commented Jan 1, 2026

Uh oh!

TechnoPorg commented Jan 15, 2026

Uh oh!

mati865 commented Jan 15, 2026

Uh oh!

davidlattimore commented Jan 15, 2026

Uh oh!

davidlattimore commented Jan 15, 2026

Uh oh!

TechnoPorg commented Jan 16, 2026

Uh oh!

mati865 commented Jan 16, 2026

Uh oh!

TechnoPorg commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TechnoPorg commented Dec 31, 2025 •

edited

Loading