Conversation
noir-r1cs/src/sparse_matrix.rs
Outdated
| ); | ||
|
|
||
| let intermediate_multiplication = | ||
| LockFreeArray::new(vec![(0, FieldElement::zero()); rhs.matrix.num_entries()]); |
There was a problem hiding this comment.
It's unfortunate that it requires allocating a temporary vector larger than the output.
noir-r1cs/src/sparse_matrix.rs
Outdated
| // wouldn't know the row a value belongs to. That's why the rows drive the | ||
| // iterator below. | ||
| // - Acquiring a mutex per column in the result was too expensive (even with | ||
| // parking_lot) |
There was a problem hiding this comment.
What I had in mind was splitting the work over the mutable output vector, and creating a iter_col method on the matrix. (Which is substantially more complicated than iter_row, but should be doable).
Can we explore this option?
recmo
left a comment
There was a problem hiding this comment.
LGTM after comments addressed.
| result[j] += value * self[i]; | ||
| } | ||
|
|
||
| let chunk_size = result.len().div_ceil(num_threads); |
There was a problem hiding this comment.
If the perf delta is small (e.g. <5%) I think I prefer a solution that let's rayon decide the chunk size. I don't like hard dependencies on the number of available threads. The number of threads does not say much, as we might be doing other work in parallel.
Instead we should pick the workload size such that it amortizes the overhead, while still allowing parallelization for large problem sizes. To approximate this I like to pick 'whatever subproblem fits in L1 cache' as problem size. And this in turn is approximated with the workload_size::<F>(result.len()) function.
|
|
||
| let chunk_size = result.len().div_ceil(num_threads); | ||
|
|
||
| // In microbenchmarks par_iter_mut.chunks outperforms par_chunks_mut slightly. |
There was a problem hiding this comment.
If it is a small difference this might be a compiler quirk. I'd stick with par_chunks_mut for simplicity and maintainability.
If .par_iter_mut().chunks(..) is faster than this is basically a bug somewhere as the former provides the libraries/compiler with strictly more information. This bug is likely to be fixed at some point.
| .enumerate() | ||
| .for_each(|(chunk_number, mut chunk)| { | ||
| let base = chunk_number * chunk_size; | ||
| let col_range = base..base + chunk_size; |
There was a problem hiding this comment.
let col_range = base..base + chunk.len().
Otherwise col_range will be out of bounds when chunk_size doesn't divide result.len(). You are protected here from this bug because col will only be in-range, but better to do it right.
There was a problem hiding this comment.
let base = chunk_number * chunk_size should still be correct as only the last chunk is not exactly chunk_size (but please confirm with rayon docs, and maybe leave a comment explaining correctness.)
No description provided.