diff --git a/validation/validate_dc2_run2.2i_object_table_dask.ipynb b/validation/validate_dc2_run2.2i_object_table_dask.ipynb new file mode 100644 index 00000000..523f27ab --- /dev/null +++ b/validation/validate_dc2_run2.2i_object_table_dask.ipynb @@ -0,0 +1,2490 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Inspection of DC2 Run 2.2i DR6 Object Table with Dask\n", + "### Michael Wood-Vasey (@wmwv)\n", + "### Last Verified to Run: 2021-06-17 by MWV\n", + "\n", + "Learning Objectives: After successfully running and reading this Notebook, the reader should be able to\n", + "1. Inspect the Run 2.2i DR6 Object Catalog through visual validation of key data quality checks.\n", + "2. Use a Dask Cluster with two NERSC nodes as the backend.\n", + "3. Learn some details of successfully using Dask\n", + " * Managing memory\n", + " * Structuring computations\n", + "\n", + "#### Run 2.2i DR6\n", + " * 166 tracts\n", + " * 147 million objects \n", + "\n", + "Logistics:\n", + "\n", + "1. These tests were conducted on NERSC through the https://jupyter.nersc.gov interface. \n", + "Note: To enable re-rastering when zooming, use the JupyterLab Classic interface.\n", + "You can launch this from an active JupyterHub Notebook by selecting \"Help->Launch Classic Notebook\".\n", + " * You can select the \"Running\" tab and then select the Notebook you want.\n", + " * You could instead browse through the full filesystem path under the \"Files\" tab to find your Notebook, but that's a lot more clicking. You may want to take this aproach to launch some other Notebook that's not currently running under JupyterHub.\n", + "\n", + "2. Requires:\n", + "```\n", + "dask\n", + "dask.distributed\n", + "holoviews\n", + "datashader\n", + "bokeh\n", + "pyarrow >= 0.13.1\n", + "```\n", + "\n", + "and `graphviz` if you want to see the Dask task graphs.\n", + "\n", + "Up-to-date versions of each of these are available in `desc-python-bleed` kernel\n", + "\n", + "3. This was run using the `desc-python-bleed` kernel\n", + "\n", + "We directly use the DPDD Parquet files.\n", + "\n", + "4. We use Dask, HoloViews, and Datashader to read Parquet files. For more on each of these, see:\n", + "\n", + "References: \n", + " https://dask.org \n", + " https://datashader.org \n", + " https://parquet.apache.org \n", + " https://holoviews.org \n", + " https://holoviz.org\n", + " \n", + "In brief:\n", + "\n", + "### Parquet\n", + "\n", + "Parquet is a column-based storage format that's part of a wider Arrow project to provide standardized, high-performance data representations in memory and on disk. It's commonly used in current data science and large data volume processing, and is the current selected standard for Rubin Observatory LSST Data Management on-disk representations of output data catalogs. The DESC Data Access Team is thus similarly using Parquet as the default underlying data format for representations of DC2 data as processed by the LSST DM Science Pipelines.\n", + "\n", + "### Dask\n", + "\n", + "Dask allows us to do processing by dividing tasks into individual workers. These workers allow us to take fuller use of available memory and processors, including those on other machines.\n", + "\n", + "Dask is solving the needs to:\n", + "\n", + "1. Load more data than fit into memory. You can delay this in either time or space.\n", + " * Delaying in time would be if you running on a memory-limited machine, then Dask will be able to chunk through the work units without simultaneously needing the full amount of memory to hold all of the data at once.\n", + " * Delaying in space means spinning up additional machines. This is often particularly powerful when connecting your front-end machine (e.g., a NERSC JupyterHub job is limited to 42 GB memory), to several full cluster compute nodes (e.g., a NERSC Haswell node is 32 real cores, 128 GB). \n", + "\n", + "2. Distributing work across multiple processors. Python and numpy/scipy are not naturally parallel or easily parallelizable. One of the common things we will do with large datasets is aggregate for both analysis and visualization. Being able to do this aggregation in parallel is a significant gain.\n", + "\n", + "\n", + "### HoloViews\n", + "\n", + "HoloViews is Dask aware and can provide Dask the correct information to build a Task Graph that effectively parallelizes the requisite data loading and computation. HoloViews can use either bokeh or matplotlib backends. If you directly use the matplotlib backend with a Dask DataFrame it will not appropriately parallelize across the workers and instead do lots of stuff in serial. Bokeh also gives some nice interactive capabilities and HoloViews knows how to appropriate set up the linking and call backs to enable coordinated zooming and selection." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Approach.\n", + "\n", + "Our limitation will be available memory and disk I/O. There's a trade-off between them. The more times we're willing to load from disk, the less we have to keep in memory.\n", + "\n", + "1. Load enough data to get an index of `good` objects. \n", + " 1. We need 'good' column and 'magerr_{SNRfilter}' column.\n", + " 2. Also define `star` and `galaxy` indices.\n", + " 3. Persist these indexes.\n", + "2. Go through each sections exploring different columns of the database\n", + " 1. RA, Dec\n", + " 2. color, color\n", + " 3. number density of galaxies\n", + " 4. extendedness\n", + " 5. blendedness\n", + " 6. shape (moments)\n", + " 7. PSF" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "TODO\n", + "1. Plot colorbars for data-shaded plots\n", + "2. Control order of samples in Overlays. Overlay currently alphabetizes in the legends. I would like to learn how to force the order so the match the heights of the contributions, e.g. \"good\", \"galaxy\", \"star\"." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import Needed Modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "import os\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "from scipy.interpolate import interp1d\n", + "\n", + "import astropy.units as u\n", + "import healpy as hp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import colorcet\n", + "\n", + "import dask\n", + "from dask.distributed import Client\n", + "\n", + "from bokeh.models import HoverTool\n", + "import dask.array as da\n", + "import dask.dataframe as dd\n", + "import datashader as ds\n", + "import holoviews as hv\n", + "from holoviews.operation import histogram\n", + "from holoviews.operation.datashader import datashade, shade, dynspread, rasterize\n", + "from holoviews.plotting.util import process_cmap\n", + "from holoviews.streams import RangeXY" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hv.extension('bokeh')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cmap = 'viridis'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you set LOCAL_DASK to True and are running on the shared JupyterHub nodes, then select a more limited set of data, likely one Tract would be good. \n", + "There's a 42 GB limit on memory directly in the JupyterHub environment at NERSC, \n", + "which you will hit pretty quickly if you try to read in all of the data.\n", + "\n", + "If you want to run the full data set you'll want two NERSC nodes through the interactive queue." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# LOCAL_DASK = True\n", + "LOCAL_DASK = False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Start our Dask Cluster\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For simple testing and illustration of how to use dask, holoview, and datashader here you can run locally on just one tract.\n", + "To run on the full set of DR6, you'll need to set up a a set of nodes to support Dask distributed. Basically you just machine that can hold the data in memory on the reader." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start a local Dask Cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if LOCAL_DASK:\n", + " client = Client()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start a Dask Cluster on an Interactive Nodes\n", + "\n", + "So instead, in a separate Terminal on Cori, ask for a pair of Nodes from the `interactive` queue. This generally completes in seconds. We ask for 2 Nodes because we'd like the full ~256 GB of available memory to store the data and the intermediate copies that often get made in some of the plots below:\n", + "\n", + "```\n", + "salloc -N 2 -C haswell --qos=interactive -t 04:00:00 -L SCRATCH\n", + "```\n", + "\n", + "And then once on the first Node, where you'll get put after the `salloc` completes, load the right Python environment.\n", + "```\n", + "python /global/common/software/lsst/common/miniconda/start-kernel-cli.py desc-python-bleed\n", + "```\n", + "\n", + "And then start up the Dask Cluster\n", + "```\n", + "SCHEDULER_FILE=${CSCRATCH}/scheduler.json\n", + "rm -rf ${SCHEDULER_FILE}\n", + "dask-scheduler --scheduler-file ${SCHEDULER_FILE} &\n", + "NUM_WORKERS=16\n", + "export MALLOC_TRIM_THRESHOLD_=131072 # 128*1024\n", + "dask-worker --nprocs ${NUM_WORKERS} --scheduler-file ${SCHEDULER_FILE} --local-directory /tmp &\n", + "```\n", + "\n", + "Then exit the environment and go to the second Node.\n", + "\n", + "```\n", + "python /global/common/software/lsst/common/miniconda/start-kernel-cli.py desc-python-bleed\n", + "```\n", + "\n", + "```\n", + "SCHEDULER_FILE=${CSCRATCH}/scheduler.json\n", + "NUM_WORKERS=16\n", + "export MALLOC_TRIM_THRESHOLD_=131072 # 128*1024\n", + "dask-worker --nprocs ${NUM_WORKERS} --scheduler-file ${SCHEDULER_FILE} --local-directory /tmp &\n", + "```\n", + "\n", + "The nodes will be printed out when the `salloc` launches. And if you forget, you can look them up under the `SLURM_NODELIST` environment variable.\n", + "\n", + "We used `CSCRATCH` above instead of `SCRATCH`, when specifying the location of the scheduler file, because `CSCRATCH` will be consistent across the nodes, whereas `SCRATCH` will be empty on the second compute node (the first one gets some special environment passing through because it's an interactive session)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Memory Management in Dask\n", + "\n", + "The details on how memory is managed is a little subtle and has a big surprise for our use case.\n", + "\n", + "The workers are each a process. Thus they are long-running processes. If you watch the memory usage you will often see it increase over time. This will look like some sort of memory leak or lack of reclaiming memory. However, it's not a memory leak in the program logic. Rather it's the process not returning memory to the kernel because it thinks it might need it again. This is generally expected behavior on unix systems and applications that expect to directly manage their own memory, even if it's surprising in this particular case.\n", + "\n", + "As one general principle, most programs running through `glibc` do not release memory back to the kernel until it's really clearly necessary. The idea is that if they used some memory, then they'll probably want to use that memory again so they'll just keep it. However, our program model here does not match that well because we have long-running processes (the workers) that open large files, just get some of the data, potentially use some temporary arrays to reformat, and move on. This often leads to factors of two (or more) in ununsed memory that's nevertheless still reserved by the worker. The Dask worker management doesn't really know about this and just uses whatever memory usage the OS reports when it asks what the memory used by a given process is. This often leads to workers getting paused or terminated even when it seems like the data actively being used should definitely fit.\n", + "\n", + "For bit more on memory management:\n", + "https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os\n", + "\n", + "To change this behavior, we have to explictly change the memory manager paramters. That's what the \n", + "\n", + "```\n", + "export MALLOC_TRIM_THRESHOLD_=131072 # 128*1024\n", + "```\n", + "\n", + "does in the code to start the workers above. It resets this malloc free to the documented default for malloc, because the default setting on NERSC is apparently much larger. This controll of `malloc` works on Linux+glibc, but is not more generally portable." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Expected messages from the Dask Workers\n", + "\n", + "Dask Notes: \n", + " * If you watch the terminal in which you launched the interactive node jobs you will start to see INFO message such as \n", + " ```\n", + " distributed.core - INFO - Event loop was unresponsive in Worker for 5.34s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.\n", + " ``` \n", + " This is expected and normal. The workers are reading data, which can take a noticeable amount of time and the Python process holds the GIL during the read." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We connect to this Dask cluster through a shared agreement on where the `SCHEDULER_FILE` is." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We then configure the dashboard URL to use the JupyterHub proxy service.\n", + "We here set the formatting string template to the correct value.\n", + "Once we actually connect the client, then client can then tell us the full link." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if not LOCAL_DASK:\n", + " scheduler_file = os.path.join(os.environ[\"SCRATCH\"], \"scheduler.json\")\n", + " dask.config.config[\"distributed\"][\"dashboard\"][\"link\"] = \"{JUPYTERHUB_SERVICE_PREFIX}proxy/{host}:{port}/status\"\n", + " client = Client(scheduler_file=scheduler_file)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Catalog and Subsampling" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We specify the directory path that contains all of the individual Parquet files. If we want to read just one tract, we could pass the full path name of just the tract. If we wanted to specify a set of tracts to run, we could pass in a regex that matches those tracts. Or we could pass in a list that contained the file paths or directory paths to read from." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "desc_data_dir = \"/global/cfs/cdirs/lsst/shared/\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_release = \"dr6\"\n", + "\n", + "run_data_dir = f\"DC2-prod/Run2.2i/dpdd/Run2.2i-{data_release}/object_dpdd_only\"\n", + "data_path_dir = os.path.join(desc_data_dir, run_data_dir)\n", + "\n", + "all_tracts_data_path = data_path_dir\n", + "sample_tract_data_path = os.path.join(data_path_dir, \"object_dpdd_tract4854.parquet\")\n", + "\n", + "data_path = all_tracts_data_path\n", + "# data_path = sample_tract_data_path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(data_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "filters = ('u', 'g', 'r', 'i', 'z', 'y')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Parquet is a column-based storage format. If we specify the columns we want to load, then we avoid the memory overhead of having to store the other columns.\n", + "Note: In principle Dask should be able to figure out what columns we actually used and only used those. But it doesn't really know what to keep around." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Select good detections:\n", + "# 1. Marked as 'good' in catalog flags.\n", + "# 2. SNR in given band > threshold\n", + "# 3. In defined simulation range\n", + "snr_threshold = 5\n", + "snr_filter = 'i'\n", + "\n", + "# We want to do a SNR cut, but magerr is the thing already calculated\n", + "# So we'll redefine our SNR in terms of magerr\n", + "magerr_cut = (2.5 / np.log(10)) / snr_threshold" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now just load the columns we need to define our `good`, `star`, and `galaxy` indexes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# columns = ['objectId']\n", + "columns = ['objectId', f'magerr_{snr_filter}', 'good', 'extendedness']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# df = dd.read_parquet(data_path, columns=columns, engine='pyarrow')\n", + "df = dd.read_parquet(\n", + " data_path,\n", + " columns=columns,\n", + " engine=\"pyarrow\",\n", + " kwargs={\"dataset\": {\"use_legacy_dataset\": False}},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each file is its own tract which has its own block of ObjectIDs\n", + "We'd like to use ObjectIDs as the index for future subsetting so we need to set up the index to do that. But we also need to tell Dask that it won't need to do any re-shuffling across partitions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask doesn't know anything about how to divide things up yet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.npartitions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "df.divisions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A `set_index` takes about a minute if the data are on disk[*]\n", + "\n", + "But, we can use the results of this `set_index` to run `set_index` for free on any future read of columns from this same set of Parquet files.\n", + "\n", + "[*] Re-runs of this Notebook can be faster because the data that we're asking for may still be in the cache of the OS or network file system." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the `set_index` by default removes the \"objectId\" column as an explicit column because it's now the index. We could choose to change this behavior it we want to be able to still say df[\"objectId\"], but it's fine for now. The only place we're really going to use the `objectId` is indeed as an index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Note that because \"objectId\" gets removed when setting the index, you can't rerun the `set_index` cell.\n", + "df = df.set_index(\"objectId\", sorted=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask now knows the objectId index division points of each partition. Note that it doesn't actually track the length, but just knowing the divisions is sufficient to look up things by the index by sending them to the right worker." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.npartitions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "df.divisions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Calculating the size of the Dask DataFrame is relatively fast because Dask knows to just ask each of the Parquet files how many rows does it have. It doesn't have to fully load it into memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len_df = len(df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(len_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "good = df[\"good\"] & (df[f\"magerr_{snr_filter}\"] < magerr_cut)\n", + "galaxy = good & (df[\"extendedness\"] > 0)\n", + "star = good & (df[\"extendedness\"] == 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each of these is now a Dask Series, which has the same index as the original DataFrame, and a boolean column. Note that they haven't been computed yet." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "good" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "good.npartitions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualize the Task Graph\n", + "\n", + "We need `graphviz` to visualize the Task Graph.\n", + "\n", + "We can then look at the structure of the computations. You can see the read steps at the bottom, and then a `getitem` for each column we're getting: \"good\", \"magerr_snrfilter\", and \"extendedness\" (in left-to-right order in the task graph).\n", + "\n", + "The work is easily divided across the 166 partitions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "try:\n", + " import graphviz\n", + " star.visualize()\n", + "except ModuleNotFoundError:\n", + " print(\"graphviz not installed.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Persist the data in Dask Cluster Worker memory\n", + "\n", + "Dask actively purges data from memory when its no longer needed by the Dask Task Graph currently doing the computation.\n", + "\n", + "This `good` index will be used everytime we load a new set of columns below so let's persist it. If you're running the `dask.distributed` scheduler[*], then the Jupyter session won't block on executing this cell. It will return Dask `Future` objects that contains the underlying task graph. That task graph will then be executed asynchronously in the background and the results will be saved in distributed memory once they-are computed.\n", + "\n", + "[*] Running the `dask.distributed` scheduler is common, even on a single node. Read more at\n", + "https://docs.dask.org/en/latest/scheduling.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "good = good.persist()\n", + "star = star.persist()\n", + "galaxy = galaxy.persist()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The persistence question will come up again as we load the additional columns. Each plot below is its own separate computation. Dask has no way of knowing that it's going to use those data again in the next plot. We thus explicitly tell Dask to persist this data frame so that it's still already loaded when we plot the next thing.\n", + "\n", + "If you don't have the physical memory across your Dask installation (whether local or remote), then skip this persist step. Running each of the plots will require re-reading the data and be a bit slower than if we had memory to keep all of the data, but will work." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The DF length calculation is quick because `len` gets remapped in useful ways in a Dask DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f'Total: {len(df)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But to get the number of good, star, and galaxy objects, we need to use `sum` instead of `len`. These Series have an entry for each row that is set to True or False. So their length isn't the thing we need. Instead it's the `sum` of the values.\n", + "\n", + "But `sum(df)` does not get handled like `len(df)`. It doesn't resolve and call a dunder function, it instead gets applied serially (switching between threads across workers) and take almost a minute.\n", + "\n", + "So we instead need to remember to call `star.sum().compute()` instead of `sum(compute)`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\n", + " f\"Total: {len(df)}, Good: {good.sum().compute()}, Stars: {star.sum().compute()}, Galaxies: {galaxy.sum().compute()}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can look at the memory used in the Dask Dashboard\n", + "https://distributed.dask.org/en/latest/worker.html?highlight=unmanaged#using-the-dashboard-to-monitor-memory-usage" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `good` hasn't actually been computed yet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "good" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "del df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The index should use about 1 byte * 147 million entries = 147 MB, but instead I get 6 GB when calling `memory_usage`, which says it reports the memory usage in bytes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "good.memory_usage().compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Object Density in RA, Dec" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. Load 'objectId', 'RA', 'Dec'.\n", + "2. Set index using _already-known_ divisions from our `good` index.\n", + "3. Use `good` index to only show those objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec_df = dd.read_parquet(\n", + " data_path,\n", + " columns=[\"objectId\", \"ra\", \"dec\"],\n", + " engine=\"pyarrow\",\n", + " kwargs={\"dataset\": {\"use_legacy_dataset\": False}},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using the already-calculated `divisions` value from `good` and specifying that things are already sorted makes this `set_index` operation free. *It is critically important to use `sorted` and `divisions`* to avoid expensive shuffle operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec_df = ra_dec_df.set_index(\"objectId\", sorted=True, divisions=good.divisions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "DC2 Run 2.x WFD and DDF regions\n", + "https://docs.google.com/document/d/18nNVImxGioQ3tcLFMRr67G_jpOzCIOdar9bjqChueQg/view\n", + "https://github.com/LSSTDESC/DC2_visitList/blob/master/DC2visitGen/notebooks/DC2_Run2_regionCoords_WFD.ipynb\n", + "\n", + "| Location | RA (degrees) | Dec (degrees) | RA (degrees) | Dec (degrees) |\n", + "|:----------------- |:------------ |:------------- |:------------ |:------------- |\n", + "| Region | WFD | WFD | DDF | DDF |\n", + "| Center | 61.856114 | -35.79 | 53.125 | -28.100 |\n", + "| North-East Corner | 71.462228 | -27.25 | 53.764 | -27.533 |\n", + "| North-West Corner | 52.250000 | -27.25 | 52.486 | -27.533 |\n", + "| South-West Corner | 49.917517 | -44.33 | 52.479 | -28.667 |\n", + "| South-East Corner | 73.794710 | -44.33 | 53.771 | -28.667 |\n", + "\n", + "(Note that the order of the rows above is different than in the DC2 papers. The order of the rows above goes around the perimeter in order.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_run2x_wfd = [[71.462228, -27.25], [52.250000, -27.25], [49.917517, -44.33], [73.794710, -44.33]]\n", + "dc2_run2x_ddf = [[53.764, -27.533], [52.486, -27.533], [52.479, -28.667], [53.771, -28.667]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dc2_run2x_wfd_df = pd.DataFrame({'ra': [coord[0] for coord in dc2_run2x_wfd] + [dc2_run2x_wfd[0][0]],\n", + " 'dec': [coord[1] for coord in dc2_run2x_wfd] + [dc2_run2x_wfd[0][1]]})\n", + "dc2_run2x_ddf_df = pd.DataFrame({'ra': [coord[0] for coord in dc2_run2x_ddf] + [dc2_run2x_ddf[0][0]],\n", + " 'dec': [coord[1] for coord in dc2_run2x_ddf] + [dc2_run2x_ddf[0][1]]})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def overlay_dc2_region(ra_dec, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df):\n", + " # This region isn't quite a polygon. The sides should be curved.\n", + " wfd_region = hv.Path(dc2_run2x_wfd_df).opts(color='red')\n", + " ddf_region = hv.Path(dc2_run2x_ddf_df).opts(color='orange')\n", + " ra_dec = ra_dec * wfd_region * ddf_region\n", + "\n", + " max_delta_ra = dc2_run2x_wfd_df['ra'][3] - dc2_run2x_wfd_df['ra'][2]\n", + " delta_dec = dc2_run2x_wfd_df['dec'][1] - dc2_run2x_wfd_df['dec'][3]\n", + " grow_buffer = 0.05\n", + "\n", + " # Notice that these are specified in increasing RA left->right\n", + " # We rely on the invert_xaxis True above to flip this in the display\n", + " # It's important to get this right because these ranges are used for data selection\n", + " # and then the range is flipped in the display.\n", + " ra_dec.opts(xlim=(dc2_run2x_wfd_df['ra'][2] - max_delta_ra * grow_buffer,\n", + " dc2_run2x_wfd_df['ra'][3] + max_delta_ra * grow_buffer))\n", + " ra_dec.opts(ylim=(dc2_run2x_wfd_df['dec'][3] - delta_dec * grow_buffer,\n", + " dc2_run2x_wfd_df['dec'][1] + delta_dec * grow_buffer))\n", + "\n", + " return ra_dec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_ra_dec(df, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df,\n", + " show_dc2_region=True, cmap=\"bmy\", bins=100, cmin=10):\n", + " \"\"\"Show rasterized RA, Dec object density.\n", + " \n", + " We're just doing this on a rectilinear grid\n", + " The distortion is noticeable from the lowest to highest Dec in the change in density due to the change in area.\"\"\"\n", + " points_ra_dec = hv.Points(df, kdims=[hv.Dimension('ra', soft_range=(dc2_run2x_wfd[2][0], dc2_run2x_wfd[3][0])),\n", + " hv.Dimension('dec', soft_range=(dc2_run2x_wfd[3][1], dc2_run2x_wfd[1][1]))])\n", + " # We have to define the colormap here now, because the opts aren't passed through the datashade->Points.\n", + " # See, e.g., https://github.com/holoviz/holoviews/issues/4125\n", + " ra_dec = datashade(points_ra_dec, cmap=process_cmap(cmap, provider=\"colorcet\"))\n", + " ra_dec = ra_dec.opts(invert_xaxis=True) # Flip to East left\n", + "\n", + " if show_dc2_region:\n", + " ra_dec = overlay_dc2_region(ra_dec, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df)\n", + " \n", + " return ra_dec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec = plot_ra_dec(ra_dec_df[good])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " import graphfiz\n", + " ra_dec_df.visualize()\n", + "except ModuleNotFoundError:\n", + " print(\"graphviz not installed.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This next command is the first one since the length counting above that will actually cause work to happen. Each plot generate triggers a `compute` on the Dask DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec.opts(width=600, height=500)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The overall object density distribution looks good.\n", + "\n", + "Notes:\n", + "* If you are viewing this through a direct JupyterLab connection (Jupyter Classic Notebook, or separately on your own machine or setup), the plot will re-raster as you zoom in and out. This functionality is not available within the JupyterHub environment. JupyterHub doesn't allow the JavaScript callbacks in the browser back to the server that are necessary to do the re-rastering.\n", + "* We explicitly excluded the tracts that overlap the DDF region (orange square upper-right corner).\n", + "* There are also a few patches that failed within the main region.\n", + "* There is an overall gradient N/S in object density, because we're plotting in rectilinear RA, Dec bins, which means that bins at the bottom in RA cover less area than those at the top. We'll fix this next with an equal-area projection.\n", + "\n", + "See the input visit coverage map here: \n", + "https://github.com/LSSTDESC/ImageProcessingPipelines/issues/97#issuecomment-498303504\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[This DC2 notebook](../contributed/ra_dec_dc2_run2.2i_dask.ipynb) shows how to do an equal-area Mollweide projection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ra_dec_df.memory_usage().compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calculate effective area covered\n", + "\n", + "To compare number densities of objects on the sky, we have to calculate the area covered by each catalog.\n", + "We'll use Healpix through HealPy to pixelate the region and then count of the number of pixels with significant numbers of objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def calculate_area(df, threshold=0.25, nside=1024, verbose=False):\n", + " \"\"\"Calculate the area covered by a catalog with 'ra', 'dec'\n", + " \n", + " Parameters:\n", + " --\n", + " cat: DataFrame, dict-like with 'ra', 'dec', keys\n", + " threshold: float\n", + " Fraction of median value required to count a pixel.\n", + " nside: int\n", + " Healpix NSIDE. NSIDE=1024 is ~12 sq arcmin/pixel, NSIDE=4096 is 0.74 sq. arcmin/pixel\n", + " Increasing nside will decrease calculated area as holes become better resolved \n", + " and relative Poisson fluctuations in number counts become more significant.\n", + " verbose: bool\n", + " Print details on nside, number of significant pixels, and area/pixel.\n", + " \n", + " Returns:\n", + " --\n", + " area: Astropy Quantity.\n", + " \"\"\"\n", + " import healpy as hp\n", + "\n", + " # MWV: The following line of code makes me sad, but \\n\",\n", + " # We need to make a matching DataFrame for nside to satisfy conservative\n", + " # Pandas 1.2.4 requirement that all ufuncs have arguments of the same type.\n", + " # `ang2pix` takes `nside`, `ra`, `dec` and so each of those need to be of the same type.\n", + " # That means we need to take our simple int nside and convert it to a Dask Series.\n", + " # We explicitly base it off the df['ra'] so that the partitions are automatically propagated\n", + " # And then set the value with 'nside' and cast to int.\n", + " nside_ds = (nside + 0 * df['ra']).astype(int)\n", + " indices = hp.ang2pix(nside_ds, df['ra'], df['dec'], lonlat=True)\n", + " idx, counts = np.unique(indices, return_counts=True)\n", + " \n", + " # Take the 25% of the median value of the non-zero counts/pixel\n", + " threshold_counts = threshold * np.median(counts)\n", + "\n", + " if verbose:\n", + " print(f'Median {np.median(counts)} objects/pixel')\n", + " print(f'Only count pixels with more than {threshold_counts} objects')\n", + "\n", + " significant_pixels, = np.where(counts > threshold_counts)\n", + " area_pixel = hp.nside2pixarea(nside, degrees=True) * u.deg**2\n", + "\n", + " if verbose:\n", + " print(f'Pixel size ~ {hp.nside2resol(nside, arcmin=True) * u.arcmin:0.2g}')\n", + " print(f'nside: {nside}, area/pixel: {area_pixel:0.4g}, num significant pixels: {len(significant_pixels)}')\n", + "\n", + " area = len(significant_pixels) * area_pixel\n", + "\n", + " if verbose:\n", + " print(f'Total area: {area:0.7g}')\n", + " \n", + " return area" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "area_dc2 = calculate_area(ra_dec_df[galaxy])\n", + "print(f'DC2 Run 2.2i area: {area_dc2:0.2f}')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "area_dc2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "num_den_dc2 = galaxy.sum().compute() / area_dc2\n", + "\n", + "# Change default expression to 1/arcmin**2\n", + "num_den_dc2 = num_den_dc2.to(1/u.arcmin**2)\n", + "print(num_den_dc2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`del` will still leave references around if there are future computations that still might depend on `ra_dec_df`. I believe the HoloViews objects count as something that is dependent on `ra_dec_df`. In contrast, `client.cancel` just removes the data, whether or not anything depends on them. So we use `client.cancel` here. Doing this means that the data to interactively rebin RA, Dec density plot is no longer available. We can still zoom in Bokeh, but there's no longer to recompute the binning." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.cancel(ra_dec_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you're watching the worker output in the terminal, you'll see a message like \n", + "```\n", + "distributed.scheduler - INFO - Client Client-2ca1dd27-cf69-11eb-ae6f-44a8422ced0a requests to cancel 0 keys\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Color-Color Diagrams and the Stellar Locus" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns = [\"objectId\"]\n", + "columns += [f'mag_{f}' for f in filters]\n", + "columns += [f'magerr_{f}' for f in filters]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_df = dd.read_parquet(\n", + " data_path,\n", + " columns=columns,\n", + " engine=\"pyarrow\",\n", + " kwargs={\"dataset\": {\"use_legacy_dataset\": False}},\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_df = mag_df.set_index(\"objectId\", sorted=True, divisions=good.divisions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert mag floats to 32-bit -- that's plenty\n", + "convert_cols = {}\n", + "for f in filters:\n", + " convert_cols[f\"mag_{f}\"] = np.float32 \n", + " convert_cols[f\"magerr_{f}\"] = np.float32\n", + " \n", + "mag_df = mag_df.astype(convert_cols)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Persist to make the plots below zoomable and re-binnable.\n", + "mag_df = mag_df.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define color columns\n", + "# These should inherit the 32-bit float type\n", + "mag_df['u-g'] = mag_df['mag_u'] - mag_df['mag_g']\n", + "mag_df['g-r'] = mag_df['mag_g'] - mag_df['mag_r']\n", + "mag_df['r-i'] = mag_df['mag_r'] - mag_df['mag_i']\n", + "mag_df['i-z'] = mag_df['mag_i'] - mag_df['mag_z']\n", + "mag_df['z-y'] = mag_df['mag_z'] - mag_df['mag_y']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# We refer to a file over in `tutorials/assets' for the stellar locus\n", + "datafile_davenport = '../tutorials/assets/Davenport_2014_MNRAS_440_3430_table1.txt'\n", + "\n", + "def get_stellar_locus_davenport(color1='gmr', color2='rmi',\n", + " datafile=datafile_davenport):\n", + " color1 = color1.replace('-', 'm')\n", + " color2 = color2.replace('-', 'm')\n", + "\n", + " data = pd.read_table(datafile, sep='\\s+', header=1)\n", + " return data[color1], data[color2]\n", + "\n", + " \n", + "def plot_stellar_locus(color1='gmr', color2='rmi',\n", + " color='blue', line_dash='dashed', line_width=2.5,\n", + " ax=None):\n", + "\n", + " color1_m = color1.replace('-', 'm')\n", + " color2_m = color2.replace('-', 'm')\n", + "\n", + " model_color1, model_color2 = get_stellar_locus_davenport(color1_m, color2_m)\n", + " model_df = pd.DataFrame({color1: model_color1, color2: model_color2})\n", + " stellar_locus = hv.Path(model_df).opts(color='blue', line_dash=line_dash, line_width=line_width)\n", + " \n", + " return stellar_locus " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_color_color(df, color1, color2, \n", + " range1=(-1, +2), range2=(-1, +2),\n", + " cmin=10, cmap='rainbow',\n", + " vmin=None, vmax=None):\n", + " \"\"\"Plot a color-color diagram. Overlay stellar locus\"\"\"\n", + " band1, band2 = color1[0], color1[-1]\n", + " band3, band4 = color2[0], color2[-1]\n", + "\n", + " clean = df[np.isfinite(df[color1]) & np.isfinite(df[color2])]\n", + " points_color1_color2 = hv.Points(\n", + " clean,\n", + " kdims=[\n", + " hv.Dimension(color1, range=range1),\n", + " hv.Dimension(color2, range=range2)]\n", + " )\n", + "\n", + " color1_color2 = datashade(points_color1_color2, cmap=process_cmap(cmap, provider='colorcet'))\n", + "\n", + " try:\n", + " stellar_locus = plot_stellar_locus(color1, color2)\n", + " color1_color2 = color1_color2 * stellar_locus\n", + " except KeyError as e:\n", + " print(f\"Couldn't plot Stellar Locus model for {color1}, {color2}\")\n", + " \n", + " return color1_color2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_four_color_color(df, vmin=0, vmax=50000):\n", + " layout = hv.Layout(\n", + " plot_color_color(df, 'g-r', 'u-g') + \\\n", + " plot_color_color(df, 'g-r', 'r-i') + \\\n", + " plot_color_color(df, 'g-r', 'i-z') + \\\n", + " plot_color_color(df, 'g-r', 'z-y'))\n", + " \n", + " layout = layout.cols(2)\n", + " \n", + " return layout" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_four_color_color(mag_df[good])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can see the stellar locus, and the galaxies (particularly clear in the r-i vs. g-r plot). The Davenport model traces out a different path in color-color space than the DC2 stars -- this is likely just due to something simple like a filter conversion I (MWV) failed to do." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the above panels will zoom in `g-r` together because HoloViews knows that they share this data column. They don't zoom \"together\" in the y-axes because those columns are not shared between the plots.\n", + "\n", + "The plots each re-raster as you zoom in and out.\n", + "\n", + "There is no brushing (selection) and linking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_four_color_color(mag_df[star])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we look at just the unresolved objects (\"stars\"), the stellar locus comes out more clearly. The obvious galaxies are no longer visible, but there's still a significant background of unresolved objects.\n", + "\n", + "The discrete islands in the data for stellar color-color plot -- most visible in `r-i` vs. `g-r` at g-r ~= 1.2 mag -- are due to the finite set of stellar models used for simulating M dwarfs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we restrict to looking at the bright stars, things become more clear.\n", + "\n", + "The SNR requirement both does a significantly better job restricting to point sources, but also is effectively a magnitude cut as well. And there are more stars at bright magnitudes than fainter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bright_magerr_threshold = 0.05\n", + "bright_star = star & (mag_df[\"magerr_r\"] < bright_magerr_threshold)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_four_color_color(mag_df[bright_star])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "------\n", + "Let's plot the galaxies on the same color-color plots\n", + "\n", + "Clearly one doesn't expect the galaxies to follow the stellar locus. But including the stellar locus lines makes it easy to guide the eye between the stars-only and the galaxies-only plots. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_four_color_color(mag_df[galaxy])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The galaxies that appear to follow the stellar locus in r-i vs. g-r are the red sequence galaxies." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Questions for further study:\n", + " 1. Is there a better comparison sample for the stellar locus than the Davenport reference?\n", + " 2. Why is the stellar locus in the Davenport 0.1--0.2 mag redder for the reddest stars than the observed data. Are there different extinction assumptions (this should be a low-extinction region). Are there different bandpasses used?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1D Density Plots" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_data_curve(data, bins, norm=None, line_color=None, line_dash=None):\n", + " \"\"\"\n", + " Parameters\n", + " ----------\n", + " data : np.array like\n", + " bins : array of bins\n", + " norm : Normalize frequencies by this factor. E.g., divide by area to get counts/area.\n", + " line_color : Valid cmap for hv.Curve(line_color=)\n", + " line_dash : Valid line dash style for hv.Curve(line_dash=)\n", + " \"\"\"\n", + " frequencies, edges = da.histogram(data, bins=bins)\n", + " if norm is not None:\n", + " frequencies /= norm\n", + "\n", + " centers = (edges[:-1] + edges[1:]) / 2\n", + " step_edges = da.dstack((edges[:-1], edges[1:])).flatten()\n", + " step_frequencies = da.dstack((frequencies, frequencies)).flatten()\n", + " \n", + " curve = hv.Curve((centers, frequencies)).opts(interpolation='steps-mid')\n", + "\n", + " if line_color is not None:\n", + " curve.opts(line_color=line_color)\n", + " if line_dash is not None:\n", + " curve.opts(line_dash=line_dash)\n", + "\n", + " return curve" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_density_curve(df, data_col,\n", + " bins=None, area=None, **kwargs):\n", + " \"\"\"\n", + " area: in square arcminutes\n", + " \"\"\"\n", + " curve = plot_data_curve(df[data_col], bins=bins, norm=area, **kwargs)\n", + "\n", + " # Assume equally bin sizes\n", + " bin_size = bins[1] - bins[0]\n", + " if area is not None:\n", + " ylabel = f\"Objects/square arcmin/{bin_size:0.2} mag bin\"\n", + " else:\n", + " ylabel = \"Objects/bin\"\n", + " \n", + " curve.opts(xlabel=data_col, ylabel=ylabel)\n", + "\n", + " return curve \n", + "\n", + "\n", + "def plot_mag_densities(df, good, star, galaxy, filt,\n", + " area=None,\n", + " log=False, range=(16, 32), bins=None,\n", + " legend_position='top_left'):\n", + " if bins is None:\n", + " bins = np.linspace(*range, 160)\n", + " \n", + " data_col = f'mag_{filt}'\n", + " densities = {}\n", + " for idx, name, color in zip((good, star, galaxy), ('good', 'star', 'galaxy'), ('green', 'blue', 'red')):\n", + " curve = plot_density_curve(df[idx], data_col, bins=bins, area=area, line_color=color)\n", + " densities[name] = curve\n", + "\n", + " overlay = hv.NdOverlay(densities, kdims='Object Type')\n", + " overlay.group = filt\n", + " overlay.opts(show_legend=True, legend_position=legend_position)\n", + " if log:\n", + " overlay.opts(logy=True)\n", + " \n", + " return overlay\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "density_plots = [\n", + " plot_mag_densities(mag_df, good, star, galaxy, filt, area=area_dc2.to_value(u.arcmin**2))\n", + " for filt in filters\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for m in density_plots[1:]:\n", + " m.opts(show_legend=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_densities = hv.Layout(density_plots)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_densities.cols(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The sharp cut in i-band is because that was the reference band for most detections. The distributions in the other bands extend to 28th mag because many of the forced-photometry measurements are consistent with 0 and our S/N cut above was on i-band flux." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note:\n", + "* The above set of plots zoom together in plot range. This is not based in any of the data values, just the plots being displayed and zooming together in the numerical value of the x axis." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Magnitude Error vs. Magnitude" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The magnitude uncertainties come directly from the poisson estimates of the flux measurements. By construction they will follow smooth curves. We here confirm that they do." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_mag_magerr(df, band, ax=None, range=(16, 28), magerr_limit=0.25, vmin=100,\n", + " cmap=\"rainbow\", snr_magerr_threshold=magerr_cut):\n", + " mag_col, magerr_col = f'mag_{band}', f'magerr_{band}'\n", + " points_mag_magerr = hv.Points(df, kdims=[hv.Dimension(mag_col, range=(14, 28)),\n", + " hv.Dimension(magerr_col, range=(0, snr_magerr_threshold))])\n", + " return datashade(points_mag_magerr, cmap=process_cmap(cmap, provider='colorcet'))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_magerr = hv.Layout([plot_mag_magerr(mag_df[good], filt) for filt in filters])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mag_magerr.cols(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're next going to focus on some other columns now. To keep memory usage lower, we will delete the `mag_df` and reload it with just the columns we need." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use `client.cancel` instead of `del` to delete `mag_df`.\n", + "\n", + "Note this will make the `datashader` plots above no longer work for interactive re-binning." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.cancel(mag_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Blendedness, Extendedness, and mag_cModel" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns = [\"objectId\"]\n", + "columns += [\"mag_g\", \"mag_r\", \"mag_i\"]\n", + "columns += [\"blendedness\", \"extendedness\"]\n", + "columns += [f\"mag_{f}_cModel\" for f in filters]\n", + "columns += [f\"magerr_{f}_cModel\" for f in filters]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extended_df = dd.read_parquet(\n", + " data_path,\n", + " columns=columns,\n", + " engine=\"pyarrow\",\n", + " kwargs={\"dataset\": {\"use_legacy_dataset\": False}},\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extended_df = extended_df.set_index(\"objectId\", sorted=True, divisions=good.divisions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert all of our columns to 32-bit -- that's plenty for everything in the Object Table (except RA, Dec)\n", + "convert_cols = {col: np.float32 for col in extended_df.columns}\n", + "extended_df = extended_df.astype(convert_cols)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're carrying a few more columns around so the following is going to use up memory, but both (1) the first run performance and (2) the interactive performance are better if we do it now.\n", + "\n", + "1. To be specific, while a column-based store like Parquet is indeed great at reading just one column, it's still faster to read two columns at a time when going through the files than to read one column, and then read a second column in another pass.\n", + "2. The interactive usage makes repeated calls to fetch the data. This gets really slow." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extended_df = extended_df.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extended_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We put this trivial color calculation after the persist:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define color columns\n", + "# These should inherit the 32-bit float type\n", + "# extended_df['u-g'] = extended_df['mag_u'] - extended_df['mag_g']\n", + "extended_df['g-r'] = extended_df['mag_g'] - extended_df['mag_r']\n", + "# extended_df['r-i'] = extended_df['mag_r'] - extended_df['mag_i']\n", + "# extended_df['i-z'] = extended_df['mag_i'] - extended_df['mag_z']\n", + "# extended_df['z-y'] = extended_df['mag_z'] - extended_df['mag_y']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Blendedness\n", + "\n", + "Blendedness is a measure of how much the identified flux from an object is affected by overlapping from other objects.\n", + "\n", + "See Bosch et al., 2018, Section 4.9.11." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Questions for futher study:\n", + "1. What do negative blendedness scores mean?\n", + "2. What happened to yield non-finite blendedness measurements?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "blendedness = datashade(\n", + " hv.Points(extended_df[good], kdims=[\"mag_i\", \"blendedness\"]),\n", + " cmap=process_cmap(\"rainbow\", provider=\"colorcet\"),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "blendedness.opts(xlim=(15, 26.5), ylim=(-0.5, 1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extendedness\n", + " \n", + "Extendedness is essentially star/galaxy separation based purely on morphology in the main detected reference band (which is `i` for most Objects).\n", + "\n", + "Extendedness a binary property in the catalog, so it's either 0 or 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extendedness = datashade(\n", + " hv.Points(extended_df[good], kdims=[\"mag_i\", \"extendedness\"]),\n", + " cmap=process_cmap(\"rainbow\", provider=\"colorcet\"),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extendedness.opts(ylim=(-0.1, +1.1)) * hv.Text(18, 0.9, \"Galaxies\") * hv.Text(18, 0.1, \"Stars\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extendedness is based off of the difference between the point-source model and extended model brightness. Specifically objects with `mag_psf - mag_cmodel > 0.164` mag are labeled with `extendedness=1` (i.e., galaxies).\n", + "\n", + "As galaxies get smaller in angular size and lower in signal-to-noise ratio, it becomes harder to clearly distinguish stars from galaxies.\n", + "\n", + "See Bosch et al. 2018, Section 4.9.10 for details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extendedness_delta_mag_cut = 0.0164\n", + "psf_cModel_mag_cut = hv.VLine(\n", + " extendedness_delta_mag_cut,\n", + " label=rf\"{extendedness_delta_mag_cut:0.4f} $\\Delta$mag cut\",\n", + ")\n", + "psf_cModel_mag_cut = psf_cModel_mag_cut.opts(color=\"green\", line_dash=\"dashed\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_delta_mag_cModel(df, filt, bins=None):\n", + " if bins is None:\n", + " bins = np.linspace(-0.1, 0.1, 201)\n", + " frequencies, edges = da.histogram(\n", + " df[f\"mag_{filt}\"] - df[f\"mag_{filt}_cModel\"], bins=bins\n", + " )\n", + " return hv.Curve((edges[:-1], frequencies)).opts(interpolation=\"steps-pre\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "filt = \"i\"\n", + "delta_mag_cModel_hists = {\n", + " \"good\": plot_delta_mag_cModel(extended_df[good], filt),\n", + " \"star\": plot_delta_mag_cModel(extended_df[star], filt),\n", + " \"galaxy\": plot_delta_mag_cModel(extended_df[galaxy], filt),\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "delta_mag_cModel = hv.NdOverlay(delta_mag_cModel_hists, kdims=\"Sample\") " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "delta_mag_cModel.opts(width=600, xlabel='mag_i[_psf] - mag_i_CModel', ylabel='Objects/bin') \\\n", + " * psf_cModel_mag_cut \\\n", + " * hv.Text(-0.05, 4000, \"Stars\") * hv.Text(0.05, 4000, \"Galaxies\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Objects that are noticeably brighter in mag_i_CModel than mag_i (which is PSF mag) have clearly more flux going out and are interpreted as galaxies. The classification of extendedness 1: \"galaxy\" vs. extendedness 0: \"star\" is based on this difference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "extended_df[\"delta_mag_cModel_i\"] = extended_df[\"mag_i\"] - extended_df[\"mag_i_cModel\"]\n", + "clean = (\n", + " good \n", + " & (-2.5 < extended_df[\"g-r\"])\n", + " & (extended_df[\"g-r\"] < 4)\n", + " & da.isfinite(extended_df[\"delta_mag_cModel_i\"])\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "psf_cModel_mag_cut = hv.HLine(extendedness_delta_mag_cut,\n", + " label=rf\"{extendedness_delta_mag_cut:0.4f} $\\Delta$mag cut\")\n", + "psf_cModel_mag_cut = psf_cModel_mag_cut.opts(color='green', line_dash=\"dashed\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "points = hv.Points(extended_df[clean], kdims=['mag_i', 'delta_mag_cModel_i'])\n", + "points = points.opts(xlabel='mag_i[_psf] - mag_cModel_i')\n", + "\n", + "yhist = points.hist(dimension='delta_mag_cModel_i', adjoin=False)\n", + "xhist = points.hist(dimension='mag_i', adjoin=False)\n", + "\n", + "shaded_points = datashade(points, cmap=process_cmap(\"rainbow\", provider=\"colorcet\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "points_color = hv.Points(extended_df[clean], kdims=['g-r', 'delta_mag_cModel_i'])\n", + "points_color_xhist = points_color.hist(dimension='g-r', dynamic=True, adjoin=False)\n", + "\n", + "shaded_points_color = datashade(points_color, cmap=process_cmap(\"rainbow\", provider=\"colorcet\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "composite = (shaded_points_color << yhist << points_color_xhist) \\\n", + " + (shaded_points * psf_cModel_mag_cut << yhist << xhist)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "composite" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can zoom in a little to see how the fixed 0.0164 mag cut works at the low SNR limit. Specifically at mag 24, we're starting to run out of stars and most things are galaxies. But that's a population prior, it's not something visible using just morphology information.\n", + "\n", + "You can see the effect of lower SNR measurements as the horizontal line at $\\Delta$mag=0 puff up due to increased uncertainties.\n", + "\n", + "TODO: \n", + "1. I don't know how to construct an AdjointLayout without a \"right\" element. So there's an extra duplicate \"delta_mag_cModel_i\" histogram that's not really helpful or projected correctly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we drop this dataframe to free up memory on the Dask cluster for analyses of other columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.cancel(extended_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Shape Parameters\n", + "\n", + "Ixx, Iyy, Ixy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns = [\"objectId\"]\n", + "\n", + "columns += [f\"Ixx_pixel_{f}\" for f in filters]\n", + "columns += [f\"Ixy_pixel_{f}\" for f in filters]\n", + "columns += [f\"Iyy_pixel_{f}\" for f in filters]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_df = dd.read_parquet(\n", + " data_path,\n", + " columns=columns,\n", + " engine=\"pyarrow\",\n", + " kwargs={\"dataset\": {\"use_legacy_dataset\": False}},\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_df = moment_df.set_index(\"objectId\", sorted=True, divisions=good.divisions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert all of our columns to 32-bit -- that's plenty for everything in the Object Table (except RA, Dec)\n", + "convert_cols = {col: np.float32 for col in moment_df.columns}\n", + "moment_df = moment_df.astype(convert_cols)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_df = moment_df.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_moments_for_filter(df, good, star, galaxy, filt,\n", + " names=['good', 'star', 'galaxy'],\n", + " colors=['blue', 'orange', 'green']):\n", + " curve_kwargs = {'color': colors, 'log': True,\n", + " 'range': (0, 50)}\n", + "\n", + " bins = np.logspace(-1, 1.5, 100)\n", + " moment_lines = {}\n", + " for prefix, ls in (('Ixx', 'solid'), ('Iyy', 'dashed'), ('Ixy', 'dotted')):\n", + " field = f'{prefix}_pixel_{filt}'\n", + " for idx, name, color in zip((good, star, galaxy), names, colors):\n", + " label = f'{prefix} {name}'\n", + " line = plot_data_curve(df.loc[idx, field], bins=bins, line_color=color, line_dash=ls)\n", + " moment_lines[label] = line\n", + "\n", + " moments_plot = hv.NdOverlay(moment_lines, kdims=\"Moments\")\n", + " moments_plot.opts(xlabel=f'{filt} Moments: Ixx, Iyy, Ixy [pixels^2]')\n", + " moments_plot.opts(ylabel='objects / bin')\n", + " \n", + " return moments_plot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_plots = [plot_moments_for_filter(moment_df, good, star, galaxy, filt) for filt in filters]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Only include legend on the first plot\n", + "for m in moment_plots[1:]:\n", + " m.opts(show_legend=False)\n", + "\n", + "ymin = 100\n", + "for m in moment_plots:\n", + " m.opts(logy=True, ylim=(ymin, None))\n", + "\n", + "moments = hv.Layout(moment_plots)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moments.cols(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "TODO:\n", + "1. Shift to a legend that's outside the grid of plots\n", + "\n", + "The stars (orange) are concentrated at low values of the source moments.\n", + "\n", + "Would be interesting to\n", + "1. Look by magnitude or SNR to understand the longer tail. Are these galaxies mis-classified as stars, or are these noise sources?\n", + "2. Distribution of ellipticity (see validate_drp to type this right)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Ellipticity\n", + "\n", + "### Define ellipticity.\n", + "\n", + "First we compute the ellipticities. This will create 18 more columns, so we didn't generate this until we needed it now. We will have to re-define `star` and `galaxy` if we want those arrays to have these new columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def ellipticity_e_e1_e2(df, Ixx, Ixy, Iyy):\n", + " \"\"\"Calculate ellipticity from second moments from a dataframe.\n", + "\n", + " Returns e, e1, and e2. Note that e1 and e2 are just the real and imaginary parts of e.\n", + "\n", + " Parameters\n", + " ----------\n", + " df : DataFrame\n", + " Ixx : column name\n", + " Ixy : column name\n", + " Iyy : column name\n", + "\n", + " Returns\n", + " -------\n", + " e, e1, e2 : (float, float, float) or (numpy.array, numpy.array, numpy.array)\n", + " Complex ellipticity, real component, imaginary component\n", + " \n", + " Copied from https://github.com/lsst/validate_drp/python/lsst/validate/drp/util.py\n", + " \"\"\"\n", + " e = (df[Ixx] - df[Iyy] + 2j*df[Ixy] ) / (df[Ixx] + df[Iyy] + 2*da.sqrt(df[Ixx]*df[Iyy] - df[Ixy]**2))\n", + " e1 = da.real(e)\n", + " e2 = da.imag(e)\n", + " return e, e1, e2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def ellipticity(df, Ixx, Ixy, Iyy):\n", + " \"\"\"Calculate ellipticity from second moments from a dataframe.\n", + "\n", + " Just returns the complex e.\n", + "\n", + " Parameters\n", + " ----------\n", + " df : DataFrame\n", + " Ixx : column name\n", + " Ixy : column name\n", + " Iyy : column name\n", + "\n", + " Returns\n", + " -------\n", + " e, e1, e2 : (float, float, float) or (numpy.array, numpy.array, numpy.array)\n", + " Complex ellipticity, real component, imaginary component\n", + " \n", + " Copied from https://github.com/lsst/validate_drp/python/lsst/validate/drp/util.py\n", + " \"\"\"\n", + " e = (df[Ixx] - df[Iyy] + 2j*df[Ixy] ) / (df[Ixx] + df[Iyy] + 2*da.sqrt(df[Ixx]*df[Iyy] - df[Ixy]**2))\n", + " \n", + " return e" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def make_ellipticity_df(df):\n", + " for filt in filters:\n", + " Ixx, Ixy, Iyy = f\"Ixx_pixel_{filt}\", f\"Ixy_{filt}\", f\"Iyy_{filt}\"\n", + " e, e1, e2 = ellipticity(df, Ixx, Ixy, Iyy)\n", + "\n", + " # 1. Is this really the right way to do this?\n", + " # 2. Is this an efficient way to do this?\n", + " df[f\"e_{filt}\"] = e\n", + " df[f\"e1_{filt}\"] = e1\n", + " df[f\"e2_{filt}\"] = e2\n", + " \n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def calc_ellipticity_df(df):\n", + " for filt in filters:\n", + " Ixx, Ixy, Iyy = f\"Ixx_pixel_{filt}\", f\"Ixy_pixel_{filt}\", f\"Iyy_pixel_{filt}\"\n", + " df[f\"e_{filt}\"] = ellipticity(df, Ixx, Ixy, Iyy)\n", + "\n", + " # 1. Is this really the right way to do this?\n", + " # 2. Is this an efficient way to do this?\n", + "# e_df[f\"e_{filt}\"] = e\n", + "# e_df[f\"e1_{filt}\"] = e1\n", + "# e_df[f\"e2_{filt}\"] = e2\n", + " \n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_df = calc_ellipticity_df(moment_df[good])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Persist because we're going to use these in a couple of different ways below.\n", + "This `persist` triggers the ellipticity calculation above to compute and the subsequent merge with `good`.\n", + "This takes about 10-15 seconds on 2 NERSC Haswell nodes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "moment_df = moment_df.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_ellipticities_for_filter(df, good, star, galaxy, filt,\n", + " names=['good', 'star', 'galaxy'],\n", + " colors=['blue', 'orange', 'green']):\n", + " hist_kwargs = {'color': colors, 'log': True, 'range': (0, 50)}\n", + "\n", + " bins = np.linspace(0, 1, 51)\n", + " ellipticity_lines = {}\n", + " for idx, name, color in zip((good, star, galaxy), names, colors):\n", + " field = f'e_{filt}'\n", + "\n", + " label = f'e {name}'\n", + " line = plot_data_curve(df.loc[idx, field], bins=bins, line_color=color, line_dash='solid')\n", + " ellipticity_lines[label] = line\n", + " \n", + " label = f'e1 {name}'\n", + " line = plot_data_curve(da.real(df.loc[idx, field]), bins=bins, line_color=color, line_dash='dashed')\n", + " ellipticity_lines[label] = line\n", + "\n", + " label = f'e2 {name}'\n", + " line = plot_data_curve(da.imag(df.loc[idx, field]), bins=bins, line_color=color, line_dash='dotted')\n", + " ellipticity_lines[label] = line\n", + "\n", + " ellipticities_plot = hv.NdOverlay(ellipticity_lines, kdims=\"Ellipticities\")\n", + " ellipticities_plot.opts(xlabel=f'{filt} ellipticity: e, e1, e2 [pixels^2]')\n", + " ellipticities_plot.opts(ylabel='objects / bin')\n", + "\n", + " return ellipticities_plot " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ellipticity_plots = [plot_ellipticities_for_filter(moment_df, good, star, galaxy, filt) for filt in filters]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Only include legend on the first plot\n", + "for m in ellipticity_plots[1:]:\n", + " m.opts(show_legend=False)\n", + "\n", + "ellipticities = hv.Layout(ellipticity_plots)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ellipticities.opts(hv.opts.Curve(xlim=(0, 1), ylim=(100, None), logy=True)).cols(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.cancel(moment_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### FWHM of the PSF\n", + "At the location of the catalog objects.\n", + "\n", + "The Object Table stores the shape parameters of the PSF model as evaluated at the location of the object.\n", + "\n", + "This is not the same as, but is certainly related to, the distribution of effective seeing in the individual images that made up the coadd." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns = [\"objectId\"]\n", + "\n", + "columns += [f\"psf_fwhm_{f}\" for f in filters]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "psf_df = dd.read_parquet(\n", + " data_path,\n", + " columns=columns,\n", + " engine=\"pyarrow\",\n", + " kwargs={\"dataset\": {\"use_legacy_dataset\": False}},\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "psf_df = psf_df.set_index(\"objectId\", sorted=True, divisions=good.divisions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert all of our columns to 32-bit -- that's plenty for everything in the Object Table (except RA, Dec)\n", + "convert_cols = {col: np.float32 for col in psf_df.columns}\n", + "psf_df = psf_df.astype(convert_cols)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "psf_df = psf_df.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_psf_fwhm(df, filt, bins=None, density=True):\n", + " psf_col = f\"psf_fwhm_{filt}\"\n", + " curve = plot_data_curve(df[psf_col], bins=bins, norm=len(df))\n", + " curve.opts(xlabel=psf_col)\n", + " return curve\n", + "\n", + "\n", + "def plot_psf_fwhm_for_filters(\n", + " df,\n", + " filters=filters,\n", + " bins=None,\n", + " density=True,\n", + " xlim=(0.5, 1.4),\n", + " alpha=0.5,\n", + " colors=(\"purple\", \"blue\", \"green\", \"orange\", \"red\", \"brown\"),\n", + "):\n", + " if bins is None:\n", + " bins = np.linspace(0, 1.5, 201)\n", + "\n", + " fwhm_histograms = {}\n", + " for filt, color in zip(filters, colors):\n", + " curve = plot_psf_fwhm(df, filt, bins=bins, density=density)\n", + " curve.opts(line_color=color)\n", + " fwhm_histograms[filt] = curve\n", + "\n", + " fwhm = hv.NdOverlay(fwhm_histograms, kdims=\"Filter\")\n", + " if density:\n", + " ylabel = \"Density [Normalize to sum=1]\"\n", + " else:\n", + " ylabel = \"Objects / bin\"\n", + " fwhm.opts(xlabel=\"Model PSF FWHM [arcsec]\")\n", + " fwhm.opts(ylabel=ylabel)\n", + " fwhm.opts(xlim=xlim)\n", + "\n", + " return fwhm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_psf_fwhm_for_filters(psf_df[good]).opts(width=600, xlim=(0.5, 1.4))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The y-band observations have a wider PSF FWHM because we didn't focus the simulated telescope properly in the simulated images. The telescope focus was set to the surface of the silicon. But the conversion happens deeper, at increasing depth with increasing wavelength. In addition, we did not correctly refract the photons in the silicon of the CCD, so the effective focal plane was even deeper in the silicon than it should have been." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.cancel(psf_df)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "desc-python-bleed", + "language": "python", + "name": "desc-python-bleed" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}