add faster implementations of Mean and MCKP estimators#458
Open
caelen00000 wants to merge 19 commits into
Open
add faster implementations of Mean and MCKP estimators#458caelen00000 wants to merge 19 commits into
caelen00000 wants to merge 19 commits into
Conversation
…into fast-mckp
…into fast-mckp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: I accidentally hit the merge button when looking at the diff, so this whole thing is somewhat half-cooked and should definitely not be merged. However, I would like to gauge interest in the changes I've made, so I'm leaving it up.
Summary
This PR is a proof-of-concept rewrite of the Mean and MultipleChoiceKnapsack EstimationMethods. On my system ( using the --cuda and --estimator-multiple-cpu arguments) the time taken to estimate noise targets and compute denoised counts is greatly reduced. When applied to the heart10k dataset, this implementation takes ~8 seconds, compared to the main branch at ~60s, and the sf-modernization-estimation branch at ~40s.
I will note that most of the speedup vs the sf-modernization-estimation branch comes from the improved mean estimator, whereas MCKP is only slightly faster here.
Mean Estimator
I use the inverse indices returned by torch.unique(noise_log_prob_coo.row, return_inverse=True) to group the indices into noise_log_prob_coo according to the m-index value. Then torch.bincounts gets a weighted sum of probability * c within each group.
This can be applied to the entire posterior, without chunking or densifying, and works seamlessly with either CPU (runtime 1.5s) or GPU (runtime 0.68s) tensors. Compare with the original at ~30s.
As far as correctness goes, it needs more testing for sure, but tentatively looks OK. All tests pass, and I compared the sum of all estimated counts and found a difference on the order of 10-6 percent, presumably due to floating-point arithmetic differences.
MCKP Estimator
My goal here was to remove the pandas dataframes. As soon as I saw them, I had some flashbacks to the time I tried to implement XGBoost in dataframes, which, performance-wise, did not go well 🫠. Unfortunately, I hadn't seen the work already done in the modernization branch, which is probably a better solution. But, if you want to avoid adding dependencies, I have an alternative.
Algorithm Overview
Similarly to the mean estimator, torch.unique inverse indices are used to group the posterior COO, but this time according the the gene. Genes with zero probability are excluded, then genes are iterated on as follows:
Here is one of the original output plots:
This loop can be run on the GPU, but is slow. As such, I only run the initial grouping step on the GPU.
Some Thoughts on Multiprocessing
If use_multiple_processes=True, this iterative method is easy enough to chunk. The challenge lies in efficient inter-process communication. In fact, the original MultipleChoiceKnapsack.estimate_noise has a comment that mentions multiprocessing causes a slowdown (which on my PC is not true, but it is still slower than it could be). This is because each argument to _mckp_chunk_estimate_noise must be pickled and copied into each child process.
The solution to this is to load the arguments into shared memory. For arbitrary objects, this is tedious but doable, and leads to moderate speedups. Check out my early commits if you want to see an admittedly very ugly example.
Luckily, torch.multiprocessing exists: its a wrapper around the python multiprocessing library that automatically handles tensor shared memory management. In a single process, I see MCKP estimation taking ~10s, but with 14 processes, it takes ~3s.
Finding the optimal number of processes and threads per process to use still needs work and currently will require individual tuning via some hardcoded variables. By default, torch uses the number of physical cores, which in my case is 14. If you launch 6 processes, that is 84 threads competing for resources. In my testing, when using multiple processes, it it best to limit each to a single thread.
Determining the total number of processes is also an issue. On linux, I find that using the number of physical cores is best, but on Windows, using approx. half as many works better. I'm assuming this is due to increased overhead from spawning processes on Windows compared to forking on linux.
Closing Notes
Like I said earlier, I didn't intended for the world to see this code in its current state, so please don't judge it too harshly. Hopefully at least some of it will prove useful. Either way, I fun writing it and learned a lot in the process! Let me know if you want me to clean any of it up and I can submit a new PR.