Demonstrate how to use Dask+Holoviews to show RA, Dec density map by wmwv · Pull Request #158 · LSSTDESC/DC2-analysis

wmwv · 2021-05-14T18:14:33Z

Uses Dask+Holoview to visualize RA, Dec using a local Dask cluster.

Can you

try this out and confirm that works for
offer suggestions for anything to do this better

wmwv · 2021-05-28T17:58:20Z

@cwwalter I added some more pedagogical material giving more details of Dask vs. Pandas DataFrame comparisons with discussions of persist, memory usage, disk I/O, and computation speeds.

I rewrote the star truth table comparison to no longer use GCRCatalogs and instead just directly read the star truth table parquet file into a Pandas DataFrame.

wmwv · 2021-06-09T17:10:31Z

@cwwalter Will you have a chance to look at this PR in the next week?

cwwalter

Looks good. There are specific pedagogical / code comments below.

I think my only general comment is that this notebook is mixing two (or three) semi-complicated things to learn together, sky projection, using holoviews for interactive plotting/rastering, and dask. So here are some things you might think about, take them or leave them as you will.

I think, that the thing you need to spend a lot of your time on to work through the notebook is the non-dask code part and, because of that some of what you are doing with dask is a bit subsumed. I'm not sure if people will really come away with a handle on the dash board, persist etc. I think especially with some of the added sections at the end, the focus is more on those other parts.

It might be worth considering building up to this in three notebooks:

Basic Dask Into
Basic holoview plotting (not necessarily using dask)
Sky projection etc using dask and holoviews.

cwwalter · 2021-06-15T17:24:03Z

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = Client()"


I suggest doing this to be explicit about how many cores and the amount of memory you are using.

from dask.distributed import LocalCluster cluster = LocalCluster(n_workers=6, threads_per_worker=1, memory_limit='6Gb') client = Client(cluster)

Note you then need to tell people how to access the dashboard.

Related: you tell people to watch the 'Dask Dashboard when you zoom in and out.' But never explicitly tell people to start it up etc.

I actually don't know how to access the Dask Dashboard for "Local" Dask cluster running on JupyterHub at NERSC. The proxy service URL seems to only work for a scheduler running on a compute Node, not on the JupyterHub Node.

Do you know?

Yes, I put in in my notes for using dask. It took me a bit to figure out but say you were on cori16 (check by starting a terminal). Then it would be

cori16-224.nersc.gov:8787

replace 16 which which ever machine you are on.

also.. when you put just a client line there you will see something like this

you are replacing the 127 part... if a bunch of people are using dask then the port might increment too.

You don't want/need to use the proxy command thing in that case BTW.

That URL gets me the link to the Notebook, but not the Dashboard.

Specifically I am on cori16, the Dashboard was launched on 8787, but when I got to https://cori16-224.nersc.gov:8787/status
I get (after a long pause) the Notebook I'm looking at in a new window. But that only works once. Future clicking hangs and never loads.

Hmm... I just looked at it :)

I think maybe it is not https: try just http:

Woah... Yes http:// works. Thank you! I would never have figured that out.

Great. FWIW, clue was the http://127.0.0.1 in the cell output.

cwwalter · 2021-06-15T17:25:03Z

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ddf = dd.read_parquet(data_path, columns=columns, engine='pyarrow', kwargs={'dataset': {'use_legacy_dataset': False}})"


I'm not sure what is going on with the kwargs here. Is it really necessary? Maybe explain that?

For pedagogical purposes something more like columns=selected_columns might be clearer.

You're right. It's not necessary here. The kwargs dict with the "use_legacy_dataset": False
tells pyarrow to use the newer Arrow API, which has improvements related to filtering: it can now push filtering down to the read, both saving memory and allow for potentially faster reads if you're reading a small subset. But we're not really using this here.

cwwalter · 2021-06-15T17:28:14Z

+   "outputs": [],
+   "source": [
+    "# Save a PNG of the plot\n",
+    "hv.save(ra_dec, 'DC2_Run2.2i_DR6_ra_dec.png', fmt='png')"


Just a note that this didn't work for me since I didn't have the "selenium package" installed.

Yeah, this fails on NERSC. It works fine for me locally. But, in general, actually saving plots from HoloViews+Bokeh is frustratingly fragile.

cwwalter · 2021-06-15T17:30:56Z

+    "Cool!  We did the projection, it ran across our workers and we can even zoom in/and out and it will dynamically rebin.\n",
+    "\n",
+    "But watch what happens with the Dask Dashboard when you zoom in and out.\n",
+    "\n",


Actually for me, I saw nothing happen on the dashboard, and wasn't even sure it was redoing anything. I realized later this was described above as not working in the juptyerlab environment, so I would change this text to say 'if...' or explain why they might not be seeing it.

cwwalter · 2021-06-15T17:33:32Z

+   "metadata": {},
+   "source": [
+    "Ouch!  I got 48 seconds on my 3GHz 8-core Xeon E5 desktop.\n",
+    "\n",


At NERSC this was 1 min 11 seconds. Since you used %%timeit instead of %%time, it did that 7 times. Ouch indeed! Because of this and all the following similar timing cases it takes a really long time to run through the notebook. Consider using just %%time.

cwwalter · 2021-06-15T17:34:41Z

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So, 36 seconds is a little better, but this is still _much_ slower than the 2.5 seconds when using all of the available cores with Dask."


I got 57 seconds at NERSC (compared to 1:11), so even less of a difference.

cwwalter · 2021-06-15T17:35:25Z

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Are the divots above the bright stars?\n",


I think the meaning of divet might be lost on many.

cwwalter · 2021-06-15T17:36:20Z

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cat.list_all_quantities()"


cat isn't defined here.

Oops! Thanks.

cwwalter · 2021-06-15T17:37:02Z

+   "source": [
+    "# We can use the same xticks, yticks <-> RA, Dec mapping as from above.\n",
+    "bright_stars_ra_dec = bright_stars_ra_dec.opts(xlabel='RA', ylabel='Dec',\n",
+    "                                               xticks=major_ticks_and_labels_x, yticks=major_ticks_and_labels_y)"


I get:

NameError: name 'major_ticks_and_labels_x' is not defined

Aargh. Sorry for not ensuring that this Notebook ran clean.

Demonstrate how to use Dask+Holoviews to show RA, Dec density map

77c68cf

wmwv requested a review from cwwalter May 14, 2021 18:14

wmwv added 8 commits May 20, 2021 15:34

Fix up scipy.interpolate import.

c3260dc

Better prepare data object for holoviews.Points.

1f46bef

Add exploration of Dask DF vs. simple Pandas DF performance.

6726be5

Tweak root_dir def to match GCRCatalogs rootdir.

0e549b3

Fill out more discussion.

be92743

Wrap RA, Dec plot tick mark replacement in a function.

6aa1ff0

Read star truth table from Parquet file.

22d8a9e

Add Learning Objectives.

f721981

Explain projection calculation more.

f29b62b

cwwalter reviewed Jun 15, 2021

View reviewed changes

wmwv added 10 commits June 18, 2021 11:39

Explain Dask Dashboard. Calculate link on NERSC JupyterHub LOCAL.

3993881

Comment out saving figures, which doesn't work.

3c43ce9

Use %%time instead of %%timeit.

bb31fcb

Small documentation edits.

485a570

Fix unused quantities.

d3a4879

Update Jupyter kernel to 3.8.10 for latest desc-python-bleed.

b20b7eb

Explain how to specify workers and memory usage in launching client.

268cd07

Remove use_legacy_dataset from kwargs. We don't use the new API.

6374eef

Clean up some text.

94099be

Verify notebook still runs.

f681720

Conversation

wmwv commented May 14, 2021

Uh oh!

wmwv commented May 28, 2021

Uh oh!

wmwv commented Jun 9, 2021

Uh oh!

cwwalter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cwwalter Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cwwalter Jun 18, 2021 •

edited

Loading