Dask version of Validation Object Table for DC2 Run 2.2i - U/wmwv/dr6 dask refactor#159
Dask version of Validation Object Table for DC2 Run 2.2i - U/wmwv/dr6 dask refactor#159wmwv wants to merge 85 commits into
Conversation
This works but the documentation says is a little crude.
|
I've tagged @cwwalter @nsevilla @SimonKrughoff for reviewers. I have some more specific questions for each of you to focus on:
|
|
@yymao @heather999 I of course always welcome your feedback and suggestions. Please feel free to check this out, but no obligation. |
|
@boutigny If you're interested, here's a fuller-fledged Dask example beyond the RA, Dec Notebook you tried out the other month. If you have a chance to run this at IN2P3, I would be interested in your experience. But no obligation. And if you know more on the specific technical level about the use of the MALLOC_TRIM_THRESHOLD memory management specification, I would be most happy to learn more. |
Pretty complete and thorough intro to using Dask for visualization/V&V, thanks. Comments:
To your concrete questions: A. I was able to run it once on Friday, on desc-python-bleed (great!) but then I ran into a missing libthrift.so.0.14.1 issue I mentioned above. I was able to run with desc-python though. B. I think, given that you decided to go full on introducing many concepts, it would be great to point to resources explaining how Dask works at NERSC. The things I have to do according to this tutorial seem to me a bit like black magic at times, maybe a couple of sentences on what the commands are doing, if you think it is appropriate here. Also, things like: what happens to the nodes I spin up to run with Dask, after I stop using them? Are they still assigned to me the following day, do I have to allocate them again? I guess these are more 'NERSC' questions that I should know, but still. C. We could definitely use this as basis for runs on complete, final or quasi-final data releases. For fast turnaround, I think there are quick ways that don't involve having the dask infrastructure (unless the overhead of having dask and compiled parquet files is compensated by this quick response). I have concrete suggestions or comments about the tests themselves, TBD at some other time. |
|
@wmwv I was looking at your question about MALLOC_TRIM_THRESHOLD. https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os Is really interesting; this was not in the documentation in January when I was trying to solve my issues. I have been doing some tests with my code again and it seems I might not be having the issue anymore. I'm still testing but this could be a change at cori, but also I wrote the healpixel into a copy of the the files so I didn't have to do in-place repartitions which was a huge memory and resource issue. Perhaps the problem was coming from there. Anyway, I will find out. But, the main thing I wanted to ask was is it you have any evidence you really need this? I had never had a problem in the past and only had this leak issue when calculating 2pt-functions on the entire skysim 5000 data set. I never saw a possible need for it before. Are you just doing this out of safety, or because you saw an issue? We should avoid it if possible since it is a bit confusing and it will degrade performance. |
|
Sadly
is not true. When I bumped the sizes up I started seeing the same thing again. Later versions of Dask's dashboard now do a better job of showing you how much unmanaged memory you have. I can see I have a lot, but I tried the "one time debugging" fix in the MALLOC_TRIM_THRESHOLD section and it didn't kill most of it. So, I have isolated the problem more, but that doesn't seem to be it. I'm also puzzling a little over how we would get that env variable set on the worker nodes on the other machines. I'll keep working.. |
Yes, my testing explicitly showed that without this set, I hit memory limits and workers were killed very often (specifically at certain steps the generated large temporary arrays). With this set, memory limits were not hit, reported memory usage was 3-4x less and the Notebook ran fine.
It is deeply confusing and I really hate having to do it. As @nsevilla says, this looks like deep black magic.
I'm not deeply concerned about this. In principle yes, but in practice I don't think the breaks to kernel to release memory are impacting performance that much. I don't see any evidence in the speed tests with and without. |
Thank you. I would be most grateful if you can figure out how to run all of this through more civilized options. I would really like to eliminate that ridiculously awkward cell with the various parts to cut and paste in different terminals and different loaded environments. |
|
OK good news is I think I see how we can easily set this in the wrapper setup script. We are already setting some other variables that are set on each worker. Still struggling with my problem.. |
This PR introduces a Dask-based Notebook to do some top-level validation of the DC2 Run2.2i DR6 Object Table.
This is based on the existing
validation/validate_dc2_run2.2i_object_table.ipynbNotebook, which uses Pandas DataFrames and is limited to the available memory on one machine, which doesn't fit all of the needed data from DC2 DR6 Object Table.Dask both allows the use of the full dataset and is actually faster for some operations as it makes better use of the available CPUs.