diff --git a/contributed/ra_dec_dc2_run2.2i_dask.ipynb b/contributed/ra_dec_dc2_run2.2i_dask.ipynb new file mode 100644 index 00000000..8059de2b --- /dev/null +++ b/contributed/ra_dec_dc2_run2.2i_dask.ipynb @@ -0,0 +1,1449 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# RA, Dec with Dask+Holoviews for DC2 Run 2.2i DR6 Object Table\n", + "### Michael Wood-Vasey (@wmwv)\n", + "### Last Verified to Run: 2021-07-26 by MWV\n", + "\n", + "After completing this Notebook you should be able to \n", + "1. Inspect the Run 2.2i DR6 RA, Dec distribution using the Object Table Parquet files and Dask. \n", + "2. Describe examples of when using Dask is key, when it's merely useful, and when there's no clear benefit.\n", + "3. Learn to adapt a function from a library to be Dask aware.\n", + "4. Use HoloViews to construct a density plot with on-the-fly re-binning.\n", + "5. Understand what computations are legitimately slow (trig) and which are trivial to compute on the fly (polynomials).\n", + "6. Read the star truth table from Parquet file and overlay on RA, Dec distribution of processed data.\n", + "\n", + "#### Run 2.2i DR6d as of 2020-08-12 includes \n", + " * 166 tracts\n", + " * 147 million objects\n", + "\n", + "Logistics:\n", + "\n", + "1. These tests were conducted on NERSC through the https://jupyter.nersc.gov interface. \n", + "Note: To enable re-rastering when zooming, use the JupyterLab Classic interface.\n", + "You can launch this from an active JupyterHub Notebook by selecting \"Help->Launch Classic Notebook\".\n", + " * Assuming that you are currently reading this Notebook in JupyterHub and have an active kernel.\n", + " * You can select the \"Running\" tab and then select the Notebook you want.\n", + " * You could instead browse through the full filesystem path under the \"Files\" tab to find your Notebook, but that's a lot more clicking. You may want to take this aproach to launch some other Notebook that's not currently running under JupyterHub.\n", + "\n", + "2. Requires:\n", + "```\n", + "healpy\n", + "holoviews\n", + "datashader\n", + "bokeh\n", + "pyarrow >= 0.13.1\n", + "```\n", + "\n", + "Up-to-date versions of each of these are available in `desc-python-bleed` kernel\n", + "\n", + "3. This was run using the `desc-python-bleed` kernel\n", + "\n", + "We directly use the DPDD Parquet files." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import Needed Modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "import os\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import astropy.units as u\n", + "import healpy as hp\n", + "\n", + "import scipy.interpolate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import colorcet\n", + "\n", + "import dask\n", + "from dask.distributed import Client\n", + "\n", + "from bokeh.models import HoverTool\n", + "import dask.array as da\n", + "import dask.dataframe as dd\n", + "import datashader as ds\n", + "import holoviews as hv\n", + "from holoviews.operation import histogram\n", + "from holoviews.operation.datashader import datashade, shade, dynspread, rasterize\n", + "from holoviews.plotting.util import process_cmap\n", + "from holoviews.streams import RangeXY" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hv.extension('bokeh')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cmap = 'viridis'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Start our Dask Cluster\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're only going to load the RA, Dec, so we don't need that much memory for the final product.\n", + "There's a 42 GB per-Notebook memory limit directly in the NERSC JupyterHub environment on a shared node, which we will suceed in staying under.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start a local Dask Cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can specify the number of workers, threads, and memory when we initialize the Dask client\n", + "You can also just call `Client()` and it will figure out how to use all of the available resources. However, that's not what we want if we're running the Dask cluster directly on a NERSC JupyterHub shared node. So we specify 6 workers with a memory limit of 6 GB per worker to keep under our 42 GB cap." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client = Client(n_workers=6, memory_limit=\"6Gb\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is a dashboard for the Dask cluster. This dashboard provides a view into the activity of the workers, task graphs, memory usage, and some big-picture performance data. [More information about Dask Dashboards](https://docs.dask.org/en/latest/diagnostics-distributed.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's a bit of magic to figure out what the URL of the Dask Dashboard is at NERSC. This is only for this use of a LOCAL Dask cluster on the JupyterHub node. We set the Dask configuration link format string to printout the current node we're on plus the `{port}` from the `client` object. Then when you `repr` the `client` you get the URL string.\n", + "\n", + "If you're running on:\n", + "* Your own copy of the data on your local machine then click on the \"Dashboard\" URL above.\n", + "* NERSC on the JupyterHub node and local cluster, click the \"Dashboard\" URL below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hostname = f\"{os.getenv('HOSTNAME')}-224.nersc.gov\"\n", + "dask.config.config[\"distributed\"][\"dashboard\"][\"link\"] = \"http://\" + hostname + \":{port}/status\"\n", + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This is the central root for all data\n", + "# You can look this up in GCRCatalogs.site_config.site_rootdir.yaml for NERSC and IN2P3\n", + "# As of 2021-05-28 on NERSC this is:\n", + "desc_data_dir = \"/global/cfs/cdirs/lsst/shared\"\n", + "\n", + "# If you have a local copy of the data you can set your own base directory here.\n", + "# You will need a set of the DPDD Object Table parquet files (~112 GB) for the main part of this Notebook\n", + "# And then the star truth table (1.3 GB) for the final part at the end.\n", + "# desc_data_dir = \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_release = \"dr6\"\n", + "\n", + "run_data_dir = f\"DC2-prod/Run2.2i/dpdd/Run2.2i-{data_release}/object_dpdd_only\"\n", + "data_path = os.path.join(desc_data_dir, run_data_dir)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(data_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns = ['ra', 'dec']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf = dd.read_parquet(data_path, columns=columns, engine='pyarrow')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Object Density in RA, Dec" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "DC2 Run 2.x WFD and DDF regions\n", + "https://docs.google.com/document/d/18nNVImxGioQ3tcLFMRr67G_jpOzCIOdar9bjqChueQg/view\n", + "https://github.com/LSSTDESC/DC2_visitList/blob/master/DC2visitGen/notebooks/DC2_Run2_regionCoords_WFD.ipynb\n", + "\n", + "| Location | RA (degrees) | Dec (degrees) | RA (degrees) | Dec (degrees) |\n", + "|:----------------- |:------------ |:------------- |:------------ |:------------- |\n", + "| Region | WFD | WFD | DDF | DDF |\n", + "| Center | 61.856114 | -35.79 | 53.125 | -28.100 |\n", + "| North-East Corner | 71.462228 | -27.25 | 53.764 | -27.533 |\n", + "| North-West Corner | 52.250000 | -27.25 | 52.486 | -27.533 |\n", + "| South-West Corner | 49.917517 | -44.33 | 52.479 | -28.667 |\n", + "| South-East Corner | 73.794710 | -44.33 | 53.771 | -28.667 |\n", + "\n", + "(Note that the order of the rows above is different than in the DC2 papers. The order of the rows above goes around the perimeter in order.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_run2x_wfd = [[71.462228, -27.25], [52.250000, -27.25], [49.917517, -44.33], [73.794710, -44.33]]\n", + "dc2_run2x_ddf = [[53.764, -27.533], [52.486, -27.533], [52.479, -28.667], [53.771, -28.667]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_run2x_wfd_df = pd.DataFrame({'ra': [coord[0] for coord in dc2_run2x_wfd] + [dc2_run2x_wfd[0][0]],\n", + " 'dec': [coord[1] for coord in dc2_run2x_wfd] + [dc2_run2x_wfd[0][1]]})\n", + "dc2_run2x_ddf_df = pd.DataFrame({'ra': [coord[0] for coord in dc2_run2x_ddf] + [dc2_run2x_ddf[0][0]],\n", + " 'dec': [coord[1] for coord in dc2_run2x_ddf] + [dc2_run2x_ddf[0][1]]})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def overlay_dc2_region(ra_dec, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df):\n", + " # This region isn't quite a polygon. The sides should be curved.\n", + " wfd_region = hv.Path(dc2_run2x_wfd_df).opts(color='red')\n", + " ddf_region = hv.Path(dc2_run2x_ddf_df).opts(color='orange')\n", + " ra_dec = ra_dec * wfd_region * ddf_region\n", + "\n", + " max_delta_ra = dc2_run2x_wfd_df['ra'][3] - dc2_run2x_wfd_df['ra'][2]\n", + " delta_dec = dc2_run2x_wfd_df['dec'][1] - dc2_run2x_wfd_df['dec'][3]\n", + " grow_buffer = 0.05\n", + "\n", + " # Notice that these are specified in increasing RA left->right\n", + " # We rely on the invert_xaxis True above to flip this in the display\n", + " # It's important to get this right because these ranges are used for data selection\n", + " # and then the range is flipped in the display.\n", + " ra_dec.opts(xlim=(dc2_run2x_wfd_df['ra'][2] - max_delta_ra * grow_buffer,\n", + " dc2_run2x_wfd_df['ra'][3] + max_delta_ra * grow_buffer))\n", + " ra_dec.opts(ylim=(dc2_run2x_wfd_df['dec'][3] - delta_dec * grow_buffer,\n", + " dc2_run2x_wfd_df['dec'][1] + delta_dec * grow_buffer))\n", + "\n", + " return ra_dec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def overlay_dc2_region_ddf(ra_dec, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df):\n", + " # This region isn't quite a polygon. The sides should be curved.\n", + " ddf_region = hv.Path(dc2_run2x_ddf_df).opts(color='orange')\n", + " ra_dec = ra_dec * ddf_region\n", + "\n", + " max_delta_ra = dc2_run2x_wfd_df['ra'][3] - dc2_run2x_wfd_df['ra'][2]\n", + " delta_dec = dc2_run2x_wfd_df['dec'][1] - dc2_run2x_wfd_df['dec'][3]\n", + " grow_buffer = 0.05\n", + "\n", + " # Notice that these are specified in increasing RA left->right\n", + " # We rely on the invert_xaxis True above to flip this in the display\n", + " # It's important to get this right because these ranges are used for data selection\n", + " # and then the range is flipped in the display.\n", + " ra_dec.opts(xlim=(dc2_run2x_wfd_df['ra'][2] - max_delta_ra * grow_buffer,\n", + " dc2_run2x_wfd_df['ra'][3] + max_delta_ra * grow_buffer))\n", + " ra_dec.opts(ylim=(dc2_run2x_wfd_df['dec'][3] - delta_dec * grow_buffer,\n", + " dc2_run2x_wfd_df['dec'][1] + delta_dec * grow_buffer))\n", + "\n", + " return ra_dec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_ra_dec(df, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df,\n", + " show_dc2_region=True, cmap=\"bmy\", bins=100, cmin=10):\n", + " \"\"\"Show rasterized RA, Dec object density.\n", + " \n", + " We're just doing this on a rectilinear grid\n", + " The distortion is noticeable from the lowest to highest Dec in the change in density due to the change in area.\"\"\"\n", + " points_ra_dec = hv.Points(df, kdims=[hv.Dimension('ra', soft_range=(dc2_run2x_wfd[2][0], dc2_run2x_wfd[3][0])),\n", + " hv.Dimension('dec', soft_range=(dc2_run2x_wfd[3][1], dc2_run2x_wfd[1][1]))])\n", + " # We have to define the colormap here now, because the opts aren't passed through the datashade->Points.\n", + " # See, e.g., https://github.com/holoviz/holoviews/issues/4125\n", + " ra_dec = datashade(points_ra_dec, cmap=process_cmap(cmap, provider=\"colorcet\"), precompute=True)\n", + " ra_dec = ra_dec.opts(invert_xaxis=True) # Flip to East left\n", + "# ra_dec = ra_dec.opts(precompute=True)\n", + " if show_dc2_region:\n", + " ra_dec = overlay_dc2_region(ra_dec, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df)\n", + " \n", + " return ra_dec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec = plot_ra_dec(ddf)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec.opts(width=800, height=700)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# In principle the following should save a PNG of the plot\n", + "# But in practice I find that I can't get it to reliably work at NERSC.\n", + "# It requires the `selenium` package and specific web browers that it can drive from the command line.\n", + "\n", + "# hv.save(ra_dec, 'DC2_Run2.2i_DR6_ra_dec.png', fmt='png')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For exaples of specifying hover-over tools in Bokeh, see:\n", + "\n", + "https://holoviews.org/user_guide/Plotting_with_Bokeh.html\n", + "\n", + "https://docs.bokeh.org/en/latest/docs/user_guide/tools.html\n", + "\n", + "https://holoviz.org/tutorial/Large_Data.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The overall object density distribution looks good.\n", + "\n", + "Notes:\n", + "* If you are viewing this through a direct JupyterLab connection (Jupyter Classic Notebook, or separately on your own machine or setup), the plot will re-raster as you zoom in and out. This functionality is not available within the JupyterHub environment. JupyterHub doesn't allow the JavaScript callbacks in the browser back to the server that are necessary to do the re-rastering.\n", + "* We explicitly excluded the tracts that overlap the DDF region (orange square upper-right corner).\n", + "* There are also a few patches that failed within the main region.\n", + "* The saved files are significant cropped. I don't understand what's going on.\n", + "\n", + "See the input visit coverage map here: \n", + "https://github.com/LSSTDESC/ImageProcessingPipelines/issues/97#issuecomment-498303504\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Mollweide Projection (Equal-Area)\n", + "\n", + "We next will set up an equal-area projection so that our object density plot doesn't have a `1/cos(dec)` gradient.\n", + "\n", + "We could do this with GeoViews, but that has some significant dependencies and it's still not easy to get the full Points adjustable zoom/reraster behavior we want. Instead we're going to define the projections ourself, with a little help from HealPy (which is already part of the `desc-python` environment) to get the rotation matrices. HealPy itself isn't Dask-aware so we have to rewrite the equivalent functions to be compatibile with delayed execution." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define the projection to be at the center of the WFD region:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_run2x_wfd_center = [(dc2_run2x_wfd_df['ra'][0] + dc2_run2x_wfd_df['ra'][1])/2,\n", + " (dc2_run2x_wfd_df['dec'][0] + dc2_run2x_wfd_df['dec'][2])/2] \n", + "mollweide = hp.projector.MollweideProj(rot=(dc2_run2x_wfd_center[0], dc2_run2x_wfd_center[1]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The projection from RA, Dec -> projected X, Y involves root finding, which is somewhat expensive. But the actual solution is smoothly varying so we set up an interpolation routine built on the set of points in the HealyPy MollweideProj object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def make_linear_interpolator(mollweide):\n", + " X, Y = mollweide._MollweideProj__molldata\n", + " return scipy.interpolate.interp1d(X, Y, bounds_error=False, fill_value=(Y[0], Y[-1]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define our linear interpolating function and rotation matrix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lininterp = make_linear_interpolator(mollweide)\n", + "rotmat = mollweide.rotator._matrix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# dir2vec from healpy.rotator\n", + "# Rewritten to work for Dask\n", + "def dir2vec(theta, phi):\n", + " lon, lat = theta, phi\n", + " theta, phi = np.pi / 2 - da.radians(lat), da.radians(lon)\n", + " ct, st, cp, sp = da.cos(theta), da.sin(theta), da.cos(phi), da.sin(phi)\n", + " vx, vy, vz = st * cp, st * sp, ct\n", + " return vx, vy, vz\n", + "\n", + "def vec2dir(vx, vy, vz):\n", + " r = da.sqrt(vx ** 2 + vy ** 2 + vz ** 2)\n", + " theta = da.arccos(vz / r)\n", + " phi = da.arctan2(vy, vx)\n", + " \n", + " return theta, phi\n", + "\n", + "def vec2xy(vx, vy, vz, mollweide):\n", + " rotmat = mollweide.rotator._matrix\n", + " \n", + " # MWV: I think we have to trigger this computing the lengths to get the partitions aligned\n", + " # because while the Dask Series contain the number of partitions,\n", + " # it does not have the size of the partitions.\n", + " coords = da.stack([vx.to_dask_array(lengths=True),\n", + " vy.to_dask_array(lengths=True),\n", + " vz.to_dask_array(lengths=True)])\n", + " vxp, vyp, vzp = da.tensordot(rotmat, coords, axes=(1, 0))\n", + "\n", + " theta, phi = vec2dir(vxp, vyp, vzp)\n", + " \n", + " phi = (phi + np.pi) % (2 * np.pi) - np.pi\n", + " lat = (np.pi / 2) - theta\n", + " \n", + " phi = phi.to_dask_dataframe(index=vx.index)\n", + " lat = lat.to_dask_dataframe(index=vx.index)\n", + "\n", + " # Wrap the result of the SciPy interpolation function as a Dask Array\n", + " A = dd.map_partitions(lininterp, lat, meta=('A', 'float64'))\n", + " \n", + " flip = mollweide._flip\n", + "\n", + " x = flip * (2 / np.pi) * phi * da.cos(A)\n", + " y = da.sin(A)\n", + "\n", + " return x, y\n", + " \n", + "def moll_ang2xy(theta, phi, mollweide):\n", + " vx, vy, vz = dir2vec(theta, phi)\n", + " return vec2xy(vx, vy, vz, mollweide)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could do some fancy polygon shape math to get the shape of the regions corrected in the Mollweide projection, but instead we'll just do the simple thing and sample lots of points around the edges and then reproject those points to create our WFD and DDF outline overlays." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def fill_in_sides_from_corners(x, y, n=100):\n", + " edges_x = []\n", + " edges_y = []\n", + " for start, end in zip(x[:-1], x[1:]):\n", + " edges_x.extend(np.linspace(start, end, n))\n", + " for start, end in zip(y[:-1], y[1:]):\n", + " edges_y.extend(np.linspace(start, end, n))\n", + " \n", + " return edges_x, edges_y" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def get_overlay(df, mollweide, color='red', **kwargs):\n", + " df['ra'], df['dec']\n", + " edges_ra, edges_dec = fill_in_sides_from_corners(df['ra'], df['dec'], **kwargs)\n", + " edges_x, edges_y = mollweide.ang2xy(edges_ra, edges_dec, lonlat=True)\n", + " \n", + " return hv.Path((edges_x, edges_y))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wfd_outline = get_overlay(dc2_run2x_wfd_df, mollweide).opts(color='red')\n", + "ddf_outline = get_overlay(dc2_run2x_ddf_df, mollweide).opts(color='orange')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_ra_dec_mollweide(df, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df,\n", + " show_dc2_region=True, cmap=\"bmy\", cmin=10,\n", + " also_return_mollweide=False):\n", + " \"\"\"Use a Mollweide projection to get equal-area densities in the aggregation.\n", + " \n", + " also_return_mollweide: [bool] Return both the holoviews map and the mollweide projection object as a tuple\n", + " \"\"\"\n", + "\n", + " dc2_run2x_wfd_center = [(dc2_run2x_wfd_df['ra'][0] + dc2_run2x_wfd_df['ra'][1])/2,\n", + " (dc2_run2x_wfd_df['dec'][0] + dc2_run2x_wfd_df['dec'][2])/2] \n", + " mollweide = hp.projector.MollweideProj(rot=(dc2_run2x_wfd_center[0], dc2_run2x_wfd_center[1]))\n", + " x, y = moll_ang2xy(df['ra'], df['dec'], mollweide=mollweide)\n", + " ddf= df.assign(x=x, y=y)\n", + " \n", + " points_ra_dec = hv.Points(ddf, ['x', 'y'])\n", + "\n", + " # We have to define the colormap here now, because the opts aren't passed through the datashade->Points.\n", + " # See, e.g., https://github.com/holoviz/holoviews/issues/4125\n", + "# ra_dec = datashade(points_ra_dec, cmap=process_cmap(cmap, provider=\"colorcet\"))\n", + " ra_dec = rasterize(points_ra_dec, width=1080, height=1080)\n", + " \n", + " if also_return_mollweide:\n", + " return ra_dec, mollweide\n", + " else:\n", + " return ra_dec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use the mollweide projection we defined above to be at the center of the DC2 WFD region\n", + "ra_dec_moll, mollweide = plot_ra_dec_mollweide(ddf, also_return_mollweide=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we're going to invent some guiding RA, Dec ticks instead of the Mollweide projected-space x, y.\n", + "These aren't rigid gridelines, but they help guide us around the outside of the region." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_ra_dec_tick_labels(mollweide, n_ra=14, n_dec=10, ra_range=(74, 48), dec_range=(-45, -27)):\n", + " major_ticks_ra = np.linspace(ra_range[0], ra_range[1], n_ra)\n", + " major_ticks_dec = np.linspace(dec_range[0], dec_range[1], n_dec)\n", + " # If you set the Dec to the be the rotation center you get Delta x steps that are constant\n", + " left_ra = np.zeros(n_dec) + dc2_run2x_wfd_df['ra'][0] # It doesn't matter what this is, because it doesn't affect Dec.\n", + " bottom_dec = np.zeros(n_ra) + dc2_run2x_wfd_df['dec'][2]\n", + " major_ticks_x, _ = mollweide.ang2xy(major_ticks_ra, bottom_dec, lonlat=True)\n", + " # RA doesn't matter for Dec\n", + " _, major_ticks_y = mollweide.ang2xy(left_ra, major_ticks_dec, lonlat=True)\n", + "\n", + " major_ticks_and_labels_x = [(x, f\"{ra:0.0f}\") for x, ra in zip(major_ticks_x, major_ticks_ra)]\n", + " major_ticks_and_labels_y = [(y, f\"{dec:0.0f}\") for y, dec in zip(major_ticks_y, major_ticks_dec)]\n", + " \n", + " return major_ticks_and_labels_x, major_ticks_and_labels_y\n", + "\n", + "\n", + "def decorate_ra_dec_plot(ra_dec_plot, mollweide=mollweide):\n", + " \"\"\"\"\n", + " Set up the color bar, color range, clipping and color map and set the size of our plot\n", + " \"\"\"\n", + " major_ticks_and_labels_x, major_ticks_and_labels_y = generate_ra_dec_tick_labels(mollweide)\n", + " ra_dec_plot = ra_dec_plot.opts(xlabel='RA', ylabel='Dec', xticks=major_ticks_and_labels_x, yticks=major_ticks_and_labels_y)\n", + " ra_dec_plot = ra_dec_plot.opts(hv.opts.Image(colorbar=True, clim=(10, None), clipping_colors={'min': 'gray'},\n", + " cmap=process_cmap(\"viridis\", provider=\"matplotlib\")))\n", + " ra_dec_plot = ra_dec_plot.opts(width=480, height=400)\n", + "\n", + " return ra_dec_plot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec_moll = decorate_ra_dec_plot(ra_dec_moll, mollweide)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_ra_dec_coverage = ra_dec_moll * wfd_outline * ddf_outline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_ra_dec_coverage" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Doesn't work on NERSC\n", + "# hv.save(dc2_ra_dec_coverage, \"dc2_ra_dec_coverage.html\", backend=\"bokeh\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Cool! We did the projection, it ran across our workers and we can even zoom in/and out.\n", + "\n", + "If you're running through the classic Jupyter Notebook then javascript can make callbacks and zooming in and out will dynamically rebin. You can watch what happens with the Dask Dashboard when you zoom in and out.\n", + "\n", + "...\n", + "\n", + "So if you're watching the Dashboard you will see that `read_parquet` is called across 166 partition. It's re-reading the data each time! Now, this is all may be in various network file system or local caches and it isn't as painful as one might fear, but still, this seems very much not what we wanted.\n", + "\n", + "The key is that Dask doesn't know that you're not done. It assumed that when you showed the plot (or slightly more specifically, once `rasterize` calculated the aggregate values) you were done and didn't need the data any more so it just dropped it. Zooming around triggers a re-read." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So next let's persist the data in memory to avoid the re-reading." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf = ddf.persist()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `persist` is non-blocking. You should see blocks of `read_parquet` tasks executing in the Dask Dashboard, but you can keep running this Notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_ra_dec_moll, molleweide = plot_ra_dec_mollweide(ddf, also_return_mollweide=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_ra_dec_moll = decorate_ra_dec_plot(new_ra_dec_moll, mollweide)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_ra_dec_moll" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Huh. Zooming in and out is not particularly faster. If we look at the Dask processing, we see that's because there's still a lot of computation in the geometric projection.\n", + "\n", + "Let's see if pulling out and persisting those columns makes things faster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x, y = moll_ang2xy(ddf['ra'], ddf['dec'], mollweide=mollweide)\n", + "ddf = ddf.assign(x=x, y=y)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf = ddf.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "points_ra_dec_moll = hv.Points(ddf, ['x', 'y'])\n", + "persisted_ra_dec_moll = rasterize(points_ra_dec_moll, width=1080, height=1080)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "persisted_ra_dec_moll = decorate_ra_dec_plot(persisted_ra_dec_moll, mollweide)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "persisted_ra_dec_moll" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now the zooming is much faster because everything is in memory. There's only a blip of activity in Dask to do the aggregation (rebinnng). The only really notable thing is that with an extra two columns ('x', 'y') we now have 16 GB of memory used across the workers instead of 8 GB for ('ra', 'dec')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wait, as long as we're just persisting all of the data from the 'ra', 'dec' columns and then the 'x', 'y' projected columns to memory, do we even need to use Dask?\n", + "\n", + "YES. The projection calculation is compute-intensive enough that Dask does a better job of using the available CPUs to do the calculation. If you do a calculation directly on a Pandas DataFrame (instead of a Dask DataFrame), it will be slower. This is because Pandas is calling `numpy`, which only uses at most a few threads effectively in its calls to the the backend OpenBLAS/Intel MKL/ATLAS that is actually doing the calculation. Whereas Dask can distribute separate Python processes across the workers, each of which could use the same level of multi-threading as the single process.\n", + "\n", + "And, yes, the calculation itself that we're doing can probably be better optimized to allow `numpy` to figure out what are the array operations. But the more general point is that it's easy to have functions that work fine on 1-4 threads. Whereas diving in to the full details of what functions in `numpy` are parallelized across how many threads and to what level in different compilations of underlying libraries is a rich explorational experience. But the overhead to use multiple processes to do this in parallel is generally small; Dask is one easy way to build and schedule the work graph to do this work." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "x, y = moll_ang2xy(ddf['ra'], ddf['dec'], mollweide=mollweide)\n", + "print(x.mean(), y.mean()) # Trigger the computation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we do this all just in straight Pandas we would do:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "df = pd.read_parquet(data_path, columns=columns, engine='pyarrow')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The read is fast enough because `pyarrow` automatically uses the available CPUs to do parallel reads. It looks up how many available cores it has with `pyarrow.cpu_count()` and uses all of them.\n", + "\n", + "However, if either `OMP_NUM_THREADS` or `OMP_THREADS_LIMIT` is set, it will use that value. So if you are already set up to be in a multi-processing setup, you might find that these have been already set to as low as 1 and this `pandas.read_parquet` may in fact be somewhat slower." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pyarrow\n", + "pyarrow.cpu_count()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's try the projection calculation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "df['x'], df['y'] = mollweide.ang2xy(df['ra'], df['dec'], lonlat=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ouch! I got 48 seconds on my 3GHz 8-core Xeon E5 desktop and 1 min 50 seconds on NERSC.\n", + "\n", + "Well, maybe that's just because the HealPy projection function isn't optimized well for such large arrays. Well, this both seems somewhat unlikely as that's essentially a key HealPy usecase, but also we already know that that's not obviously true because we've just read that code above. The Dask-aware transformation code above is essentially the code taken from HealPy and it's all reasonable good array-based numpy. In particularly, `numpy.tensordot` is _really_ optimized in most underlying OpenBLAS/MKL implementations.\n", + "\n", + "Now trig functions can be expensive, and there are some more compact ways of writing some of the operations that would reduce the number of trig function calls, but there's nothing really particularly obviously wrong with the original HealPy code.\n", + "\n", + "But still, we should compare apple to apples and write out the transformation functions just like we did above for the Dask-aware case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# dir2vec from healpy.rotator\n", + "# Rewritten to work for Dask\n", + "def pd_dir2vec(theta, phi):\n", + " lon, lat = theta, phi\n", + " theta, phi = np.pi / 2 - np.radians(lat), np.radians(lon)\n", + " ct, st, cp, sp = np.cos(theta), np.sin(theta), np.cos(phi), np.sin(phi)\n", + " vx, vy, vz = st * cp, st * sp, ct\n", + " return vx, vy, vz\n", + "\n", + "def pd_vec2dir(vx, vy, vz):\n", + " r = np.sqrt(vx ** 2 + vy ** 2 + vz ** 2)\n", + " theta = np.arccos(vz / r)\n", + " phi = np.arctan2(vy, vx)\n", + " \n", + " return theta, phi\n", + "\n", + "def pd_vec2xy(vx, vy, vz, mollweide):\n", + " rotmat = mollweide.rotator._matrix\n", + " \n", + " # MWV: I think we have to trigger this computing the lengths to get the partitions aligned\n", + " # because while the Dask Series contain the number of partitions,\n", + " # it does not have the size of the partitions.\n", + " coords = [vx, vy, vz]\n", + " vxp, vyp, vzp = np.tensordot(rotmat, coords, axes=(1, 0))\n", + "\n", + " theta, phi = pd_vec2dir(vxp, vyp, vzp)\n", + " \n", + " phi = (phi + np.pi) % (2 * np.pi) - np.pi\n", + " lat = (np.pi / 2) - theta\n", + " \n", + " # Wrap the result of the SciPy interpolation function as a Dask Array\n", + " A = lininterp(lat)\n", + " \n", + " flip = mollweide._flip\n", + "\n", + " x = flip * (2 / np.pi) * phi * np.cos(A)\n", + " y = np.sin(A)\n", + "\n", + " return x, y\n", + " \n", + "def pd_moll_ang2xy(theta, phi, mollweide):\n", + " vx, vy, vz = pd_dir2vec(theta, phi)\n", + " return pd_vec2xy(vx, vy, vz, mollweide)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "df['x'], df['y'] = pd_moll_ang2xy(df['ra'], df['dec'], mollweide=mollweide)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I got 36 seconds on my 3GHz 8-core Xeon E5 desktop and 1 min 35 seconds on NERSC through the JupyterHub shared node.\n", + "\n", + "So, this is a little better, but this is still _much_ slower than the 2.5 seconds when using all of the available cores with Dask." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And it really is this calculation that is the dominant time. The HoloViews and datashader `Points` and `rasterize` are trivially fast (10 - 50 ms) because they don't actually do the aggregation/binning calculation yet." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "pd_points_ra_dec_moll = hv.Points(df, ['x', 'y'])\n", + "pd_ra_dec_moll = rasterize(pd_points_ra_dec_moll, width=1080, height=1080)\n", + "pd_ra_dec_moll = decorate_ra_dec_plot(pd_ra_dec_moll, mollweide)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And then the actual aggregation happens when we display or zoom the plot and only takes a few seconds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pd_ra_dec_moll" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask is somewhat obviously absolutely necessary when you're dealing with a dataset that doesn't fit in memory.\n", + "\n", + "But it's also a really useful way of easily constructing and executing work across multiple cores, even if we're just on one machine." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calculate the Area Covered" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def calculate_area(df, threshold=0.25, nside=1024, verbose=False):\n", + " \"\"\"Calculate the area covered by a catalog with 'ra', 'dec'\n", + " \n", + " Parameters:\n", + " --\n", + " cat: DataFrame, dict-like with 'ra', 'dec', keys\n", + " threshold: float\n", + " Fraction of median value required to count a pixel.\n", + " nside: int\n", + " Healpix NSIDE. NSIDE=1024 is ~12 sq arcmin/pixel, NSIDE=4096 is 0.74 sq. arcmin/pixel\n", + " Increasing nside will decrease calculated area as holes become better resolved \n", + " and relative Poisson fluctuations in number counts become more significant.\n", + " verbose: bool\n", + " Print details on nside, number of significant pixels, and area/pixel.\n", + " \n", + " Returns:\n", + " --\n", + " area: Astropy Quantity.\n", + " \"\"\"\n", + " import healpy as hp\n", + "\n", + " # MWV: The following line of code makes me sad, but \n", + " # We need to make a matching DataFrame for nside to satisfy the conservative\n", + " # Pandas 1.2.4 requirement that all ufuncs have arguments of the same type.\n", + " # `ang2pix` takes `nside`, `ra`, `dec` and so each of those need to be of the same type.\n", + " # That means we need to take our simple `int` nside and convert it to a Dask Series.\n", + " # We explicitly base it off the df['ra'] so that the partitions are automatically propagated\n", + " # And then set the value with 'nside' and cast to int.\n", + " nside_ds = (nside + 0 * df['ra']).astype(int)\n", + "\n", + " indices = hp.ang2pix(nside_ds, df['ra'], df['dec'], lonlat=True)\n", + " idx, counts = np.unique(indices, return_counts=True)\n", + " \n", + " # Take the 25% of the median value of the non-zero counts/pixel\n", + " threshold_counts = threshold * np.median(counts)\n", + "\n", + " if verbose:\n", + " print(f'Median {np.median(counts)} objects/pixel')\n", + " print(f'Only count pixels with more than {threshold_counts} objects')\n", + "\n", + " significant_pixels, = np.where(counts > threshold_counts)\n", + " area_pixel = hp.nside2pixarea(nside, degrees=True) * u.deg**2\n", + "\n", + " if verbose:\n", + " print(f'Pixel size ~ {hp.nside2resol(nside, arcmin=True) * u.arcmin:0.2g}')\n", + " print(f'nside: {nside}, area/pixel: {area_pixel:0.4g}, num significant pixels: {len(significant_pixels)}')\n", + "\n", + " area = len(significant_pixels) * area_pixel\n", + "\n", + " if verbose:\n", + " print(f'Total area: {area:0.7g}')\n", + " \n", + " return area" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is, in fact, a somewhat expensive calculation. It takes about 20 seconds on my desktop and 50 seconds on the shared NERSC JupyterHub node." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "area_dc2 = calculate_area(ddf)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "area_dc2 = calculate_area(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There's no real difference between the `ddf` and the `df` because the HealPy `ang2pix` call itself is not Dask aware. So all of the processing is single-processor in either case.\n", + "\n", + "In some runs I saw worse performance with `ddf`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f'DC2 Run 2.2i area: {area_dc2:0.2f}')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f'Average density: {len(df)/area_dc2.to(\"arcmin**2\")}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are the low-density points coincident with the bright stars?\n", + "\n", + "There are clear pixels in the above map that have a notably lower density. Are these from bright stars in those regions that are masking out a number of objects? Not just saturated stars, but really bright ones. This would be at a smaller resolution than the pixel size of the density map.\n", + "\n", + "Let's check the truth catalog with GCR Catalogs:\n", + "`dc2_truth_run2.2i_star_truth_summary`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "star_truth_catalog_name = \"dc2_truth_run2.2i_star_truth_summary.yaml\"\n", + "\n", + "star_truth_dir = os.path.join(desc_data_dir, \"DC2-prod\", \"Run2.2i\", \"truth\", \"startruth\")\n", + "star_truth_file = \"star_truth_summary.parquet\"\n", + "star_truth_filepath = os.path.join(star_truth_dir, star_truth_file)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def convert_nanoJansky_to_mag(flux):\n", + " \"\"\"Convert calibrated nanoJansky flux to AB mag.\n", + " \"\"\"\n", + " #pylint: disable=C0103\n", + " AB_mag_zp_wrt_Jansky = 8.90 # Definition of AB\n", + " # 9 is from nano=10**(-9)\n", + " #pylint: disable=C0103\n", + " AB_mag_zp_wrt_nanoJansky = 2.5 * 9 + AB_mag_zp_wrt_Jansky\n", + "\n", + " return -2.5 * np.log10(flux) + AB_mag_zp_wrt_nanoJansky" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check that the conversion makes sense:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "convert_nanoJansky_to_mag(1e6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The entire Star Truth summary file is only 1.3 GB and we're just pulling out 5 columns. So we can just do this directly in a Pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "all_stars = pd.read_parquet(star_truth_filepath,\n", + " columns=['ra', 'dec', 'flux_g', 'flux_r', 'flux_i'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for f in 'g', 'r', 'i':\n", + " all_stars[f\"mag_{f}\"] = convert_nanoJansky_to_mag(all_stars[f\"flux_{f}\"])\n", + " \n", + "all_stars[\"g-r\"] = all_stars[\"mag_g\"] - all_stars[\"mag_r\"]\n", + "all_stars[\"r-i\"] = all_stars[\"mag_r\"] - all_stars[\"mag_i\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "color_color_points = hv.Points(all_stars, kdims=[\"g-r\", \"r-i\"])\n", + "color_color = datashade(color_color_points)\n", + "\n", + "color_mag_points = hv.Points(all_stars, kdims=[\"g-r\", \"mag_r\"])\n", + "color_mag = datashade(color_mag_points)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "color_color + color_mag" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are two surprising features of the left plot:\n", + "1. The streaks up and to the right.\n", + " These are the finite M-stars models that are then reddened by different amounts of dust, leading to the streaking. This same reddening affects the other stars as well, but the warmer and hotter stars are distributed along a line parallel to the reddening vector (as an interesting piece of astrophysics that's the subject of a lecture in Astro classes).\n", + "2. The curving downward.\n", + " These are various cool white dwarf models, some a bit more theoretical than observed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Crude simple selection of stars in area.\n", + "# We're being generous instead of doing the geometry precisely\n", + "\n", + "min_ra, max_ra = dc2_run2x_wfd[2][0], dc2_run2x_wfd[3][0]\n", + "min_dec, max_dec = dc2_run2x_wfd[2][1], dc2_run2x_wfd[0][1]\n", + "\n", + "stars = all_stars[(min_ra < all_stars[\"ra\"]) & (all_stars[\"ra\"] < max_ra) &\n", + " (min_dec < all_stars[\"dec\"]) & (all_stars[\"dec\"] < max_dec)]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bright_stars = stars[stars[\"mag_r\"] < 10]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(bright_stars)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x, y = pd_moll_ang2xy(bright_stars['ra'], bright_stars['dec'], mollweide=mollweide)\n", + "bright_stars_ra_dec = hv.Points((x, y))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# We can use the same xticks, yticks <-> RA, Dec mapping as from above.\n", + "bright_stars_ra_dec = decorate_ra_dec_plot(bright_stars_ra_dec)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bright_stars_ra_dec.opts(color='red')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can again use composition in HoloViews to display these stars on top of our previous RA, Dec coverage map. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "persisted_ra_dec_moll * bright_stars_ra_dec" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can see how the star selection was a bit generous because we were selecting easy-to-specify limits in RA, Dec instead of getting the geometry exactly right." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Navigating around with the Bokeh UI tools, I conclude that the bright stars aren't obviously responsible for the major visible divots at the large scale.\n", + "\n", + "But it was useful to think about how to extract items from the truth catalog and overlay them on the density plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "desc-python-bleed", + "language": "python", + "name": "desc-python-bleed" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}