Conversation
a06a270 to
fc66157
Compare
|
I'm not sure if it will be possible to replace rayon for the parallel sorting use case, as a quick search hasn't turned up any other crates that do the same. |
|
Great to see experimentation on this! What system are you benchmarking on? I tried running some benchmarks myself and on my systems, I see a slowdown. 7% slower at linking a release build of ripgrep. 11% slower at linking wild and zed. That's on a 16 core, 32 thread ryzen. On my laptop with 4 cores and 8 threads, I saw a 24% slowdown. I haven't yet looked to see if I can figure out what's slowing down. Possibly there's something that can be fixed. CPU cycles and instruction counts are down, so my guess is that there's less parallelism happening somewhere. |
|
Hmm, that's very interesting. I'm running my benchmarks on an i9 with 24 cores and 32 threads. I see a 3-4% speedup when linking wild, but a 5% slowdown when linking Zed. |
|
@davidlattimore I've collected a fair bit more data on this PR now, which I would appreciate your insight into, since I'm not as familiar with the ins and outs of wild performance. For now, I've gotten perfetto support on orx-parallel working by using Rayon's thread pool with it, which I thought would be best for enabling a one-to-one comparison. All the data below is from linking a debug build of zed. It looks like the lion's share of the slowdown is happening when writing the output file. wild f0650fc --time --no-fork: Detailswild e1c79f2 --time --no-fork: DetailsPerfetto traces: |
|
What filesystem are you writing the outputs onto? This doesn't look like a tmpfs. |
|
From these docs, I get the impression that there's probably a shared work queue that the workers take work from and that they take some number of work items at a time. Perhaps set the |
Whoops, I forgot the tmpfs when benchmarking, so it ended up being on btrfs instead. |
|
Yeah, btrfs has many strengths, but it's the worst FS for parallel writing: https://gist.github.com/mati865/7817cc637f15435f536b81f05575bb21 |
|
Closing for now as per the discussion on Zulip. I think this is still worth investigating out-of-tree, but right now the improvements aren't significant enough to move forward with this PR. |


As discussed in #1072, exploring Rayon alternatives has the potential to help performance. Of the options proposed in that discussion, orx-parallel seemed to me like the best fit (although there may be other considerations that I've neglected).
I haven't finished porting everything, in particular the graph algorithms and some specific parallel iterator cases that orx-parallel doesn't support right now, like enumeration; however, I'm getting some pretty good numbers already:
This PR also depends on some changes to orx-parallel:
Let me know whether this is a direction you'd like me to continue working in.