diff --git a/README.md b/README.md index 67a4f27..a48870c 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ Dependency notes: 2. If you'd like to train the model on GPU, make sure you install TensorFlow with GPU setup in your environment. More information can be found [here](https://www.tensorflow.org/install/pip#:~:text=4.-,Install%20TensorFlow,-TensorFlow%20requires%20a). -3. Currently waiting on `TesorFlow` to support Python 3.13 before we can support it as well. +3. Currently waiting on `TensorFlow` to support Python 3.13 before we can support it as well. ### Install Required Dependencies Only For clean dependency management, use a virtual environment or a fresh Conda environment. @@ -53,12 +53,32 @@ To uninstall the package, run: pip uninstall das_anomaly ``` -## Instructions -The main steps for using the package are as follows: +## Autoencoder Model Architecture + +The package implements a convolutional autoencoder designed to compress and reconstruct power spectral density (PSD) inputs. + +- **Encoder:** A lightweight convolutional neural network reduces the input dimensionality, mapping it into a compact latent space. +- **Decoder:** A symmetric decoder reconstructs the data by upsampling the latent representation back to the original resolution. + +

+ Autoencoder architecture +

+ +--- + +## Usage Workflow + +The overall workflow for using the package is illustrated below: + +

+ Workflow flowchart +

+ +The main steps are: 1. Define constants and create a Spool of data: -Using the _config_user_ script in the das_anomaly directory, define the constants and directory paths for data, power spectral density (PSD) images, detected anomaly results, etc. You would complete adding the values as you go over the steps mentioned below. -Then, using DASCore, create an index file for the [spool](https://dascore.org/tutorial/spool.html) of data first time reading the DAS data directory: +Using the _config_user_ script in the das_anomaly directory, define the constants and directory paths for data, power spectral density (PSD) images, detected anomaly results, etc. You would complete adding the values as you go over the steps mentioned below. Then, using DASCore, create an index file for the [spool](https://dascore.org/tutorial/spool.html) of data first time reading the DAS data directory: + ### Example ```python import dascore as dc @@ -66,13 +86,14 @@ from das_anomaly.settings import SETTINGS data_path = SETTINGS.DATA_PATH -# Update will create an index of the contents for fast querying/access +# Update will create an index of the contents for fast querying/access. No need to apply update() in future. spool = dc.spool(directory_path).update() ``` Note: Creating the spool for the first time may take some time if your directory contains hundreds of gigabytes or terabytes of DAS data. However, DASCore creates an index file, allowing it to quickly query the directory on subsequent accesses. 2. Set a consistent upper bound for PSD amplitude values: -To ensure all PSD images share the same colorbar scale, determine an appropriate CLIP_VALUE_MAX in the _config_user_ script. This can be done using the `get_psd_max_clip` function, which computes the mean value of maximum amplitude from TIME_WINDOWs of the data which does not include drastic anomalies (therefore, a quick exploratory data analysis is needed here.) + +To ensure all PSD images share the same colorbar scale (in RGB), determine an appropriate CLIP_VALUE_MAX in the _config_user_ input file. This can be done using the `get_psd_max_clip` function, which computes the mean value of maximum amplitude from TIME_WINDOWs of the data which does not include obvious anomalies (therefore, a quick exploratory data analysis is needed here.) ### Example ```python from das_anomaly.psd import PSDConfig, PSDGenerator @@ -82,11 +103,12 @@ from das_anomaly.settings import SETTINGS bn_data_path = SETTINGS.BN_DATA_PATH cfg = PSDConfig(data_path=bn_data_path) gen = PSDGenerator(cfg) -percentile = 90 +percentile = 90 # data dependent clip_val = gen.run_get_psd_val(percentile=percentile) print(f"Mean {percentile}-percentile amplitude across all patches: {clip_val:.3e}") ``` 3. Generate PSD plots: + Use the `das_anomaly.psd` module and create PSD plots in RGB format and in plain mode (with no axes or colorbar). The `das_anomaly.psd.PSDGenerator reads DAS data, creates a spool using DASCore library, applies a detrend function to each patch of the chunked spool, and then average the energy over a desired time window and stack all channels together to create a spatial PSD with channels on the X-axis and frequency on the Y-axis. You can use MPI to distribute reading data and plotting PSDs over CPUs. ### Example ```python @@ -110,10 +132,12 @@ PSDGenerator(cfg).run() PSDGenerator(cfg).run_parallel() ``` 4. Select and copy known anomaly PSD plots: -From the generated PSD plots, identify and copy examples of known anomalies to the ANOMALY_IMAGES_PATH specified in the _config_user_ script. These anomalies can include events such as earthquakes from an existing catalog, seismic activity, instrument noise, anthropogenic disturbances, etc. Including these examples helps improve the accuracy of thresholding during the anomaly detection process. + +From the generated PSD plots, identify and copy examples of known anomalies to the ANOMALY_IMAGES_PATH specified in the _config_user_ input script. These anomalies can include events such as earthquakes from an existing catalog, instrument noise, anthropogenic disturbances, etc. Including these examples helps improve thresholding during the detection process. 5. Train: -The `das_anomaly.train` module helps with randomly selecting train and test PSD images and training the model (with CPU or GPU) on anomaly-free PSD images. If you need to change model's architecture, you'll need to modify the `encoder` and `decoder` functions in the [utils.py](das_anomaly/utils.py). + +The `das_anomaly.train` module helps with randomly selecting train and test PSD images and training the model (with CPU or GPU) on anomaly-free PSD images. ### Example ```python from das_anomaly.settings import SETTINGS @@ -127,13 +151,15 @@ ImageSplitter(cfg).run() cfg = TrainAEConfig() AutoencoderTrainer(cfg).run() ``` -Note: Since the `TrainSplitConfig()` function randomly selects PSD images from the generated plots, you must ensure the training and testing datasets do not include anomalies. If you have an excel sheet with time stamp of anomalies, use "exclude_known_events_from_training" in examples directory to exclude them. Or, manually inspect both the training and testing sets to ensure they do not contain apparent anomalies. Review their time- and frequency-domain representations, and remove any suspicious samples to maintain the quality of training. +Note: Since the `TrainSplitConfig()` function randomly selects PSD images from the generated plots, you must ensure the training and testing datasets do not include obvious anomalies. If you have an excel sheet with time stamp of anomalies (such as a catalog), use the "exclude_known_events_from_training" in examples directory to exclude them. Or, manually inspect both the training and testing sets to ensure they do not contain apparent anomalies. Review their time- and frequency-domain representations, and remove any suspicious samples to maintain the quality of training. + +6. Test and set thresholds: -6. Test and set a threshold: -Using the _validate_and_plot_density_ jupyter notebook in the examples directory, validate the trained model and find an appropriate density score as a threshold for anomaly detection. Then, make sure to modify the DENSITY_THRESHOLD parameter in the _config_user_ script. +Using the _validate_and_plot_density_ and _thresholding_f_score_ jupyter notebooks in the examples directory, validate the trained model and find appropriate MSE and density score as thresholds for anomaly detection. Make sure to modify the DENSITY_THRESHOLD and MSE_THRESHOLD parameters in the _config_user_ script. 7. Run the trained model: -The `das_anomaly.detect` module applies the trained model to the data to detect anomalies in the PSD images and writes their information. It also copies the detected anomaly to the RESULTS_PATH. MPI can be used to distribute PSDs over CPUs. Then, using the `das_anomaly.count` module, count the number of detected anomalies and display their details and file paths. + +The `das_anomaly.detect` module uses the trained model to detect anomalies in the PSD images and writes their information (e.g., time stamp). It also copies the detected anomaly to the RESULTS_PATH. MPI can be used to distribute PSDs over CPUs. Then, using the `das_anomaly.count` module, count the number of detected anomalies and display their details and file paths. ### Example ```python from das_anomaly.count.counter import CounterConfig, AnomalyCounter @@ -145,7 +171,7 @@ AnomalyDetector(cfg).run() # parallel processing with multiple processors using MPI: AnomalyDetector(cfg).run_parallel() -# count number of anomalies +# count number of detected anomalies cfg = CounterConfig(keyword="anomaly") anomalies = AnomalyCounter(cfg).run() print(anomalies) # prints info on number of anomalies and path to them diff --git a/docs/figures/AE-architecture.jpg b/docs/figures/AE-architecture.jpg new file mode 100644 index 0000000..46176f1 Binary files /dev/null and b/docs/figures/AE-architecture.jpg differ diff --git a/docs/figures/AE-flowchart.jpg b/docs/figures/AE-flowchart.jpg new file mode 100644 index 0000000..a463c8e Binary files /dev/null and b/docs/figures/AE-flowchart.jpg differ diff --git a/examples/bash_jobs/detect_mpi.sh b/examples/bash_jobs/detect_mpi.sh new file mode 100644 index 0000000..17a7cef --- /dev/null +++ b/examples/bash_jobs/detect_mpi.sh @@ -0,0 +1,18 @@ +#!/bin/bash +#SBATCH --ntasks=24 +#SBATCH -t 24:00:00 +#SBATCH -A YOUR_ACCOUNT +#SBATCH --mem-per-cpu=128G + +# Print start time +echo "Job started at: $(date)" + +# Load modules and environment +module load openmpi/gcc/64/4.1.5 +source activate dasanomaly + +# Recommended in a SLURM environment: use srun, not mpirun +mpirun -n $SLURM_NTASKS python -u detect_parallel.py + +# Print end time +echo "Job ended at: $(date)" diff --git a/examples/bash_jobs/detect_parallel.py b/examples/bash_jobs/detect_parallel.py new file mode 100644 index 0000000..25ab77e --- /dev/null +++ b/examples/bash_jobs/detect_parallel.py @@ -0,0 +1,4 @@ +from das_anomaly.detect import AnomalyDetector, DetectConfig + +cfg = DetectConfig() +AnomalyDetector(cfg).run_parallel() diff --git a/examples/bash_jobs/psd.sh b/examples/bash_jobs/psd.sh new file mode 100644 index 0000000..654f650 --- /dev/null +++ b/examples/bash_jobs/psd.sh @@ -0,0 +1,22 @@ +#!/bin/bash +#SBATCH -t 01:00:00 +#SBATCH -A YOUR_ACCOUNT +#SBATCH --mem-per-cpu=16G + +# Print start time +echo "Job started at: $(date)" + +# Load modules and environment +source activate dasanomaly + +# Run the script +python << EOF +from das_anomaly.psd import PSDConfig, PSDGenerator + +cfg = PSDConfig() +# serial processing with single processor: +PSDGenerator(cfg).run() +EOF + +# Print end time +echo "Job ended at: $(date)" diff --git a/examples/bash_jobs/psd_mpi.sh b/examples/bash_jobs/psd_mpi.sh new file mode 100644 index 0000000..70e5b29 --- /dev/null +++ b/examples/bash_jobs/psd_mpi.sh @@ -0,0 +1,18 @@ +#!/bin/bash +#SBATCH --ntasks=24 +#SBATCH -t 6:00:00 +#SBATCH -A YOUR_ACCOUNT +#SBATCH --mem-per-cpu=16G + +# Print start time +echo "Job started at: $(date)" + +# Load modules and environment +module load openmpi/gcc/64/4.1.5 +source activate dasanomaly + +# Recommended in a SLURM environment: use srun, not mpirun +mpirun -n $SLURM_NTASKS python -u psd_parallel.py + +# Print end time +echo "Job ended at: $(date)" diff --git a/examples/bash_jobs/psd_parallel.py b/examples/bash_jobs/psd_parallel.py new file mode 100644 index 0000000..09dbeee --- /dev/null +++ b/examples/bash_jobs/psd_parallel.py @@ -0,0 +1,5 @@ +from das_anomaly.psd import PSDConfig, PSDGenerator + +cfg = PSDConfig() +# parallel processing with multiple processors using MPI: +PSDGenerator(cfg).run_parallel() diff --git a/examples/bash_jobs/train.sh b/examples/bash_jobs/train.sh new file mode 100644 index 0000000..464a828 --- /dev/null +++ b/examples/bash_jobs/train.sh @@ -0,0 +1,23 @@ +#!/bin/bash +#SBATCH -p h200_normal_q +#SBATCH --gres=gpu:1 +#SBATCH --qos=tc_h200_normal_short +#SBATCH -t 1-00:00:00 +#SBATCH -A YOUR_ACCOUNT + +# Print start time +echo "Job started at: $(date)" + +# Load modules and environment +source activate dasanomaly + +# Run the script +python <<'PY' +from das_anomaly.train import TrainAEConfig, AutoencoderTrainer + +cfg = TrainAEConfig() +AutoencoderTrainer(cfg).run() +PY + +# Print end time +echo "Job ended at: $(date)" diff --git a/examples/plot_waterfall.ipynb b/examples/plot_waterfall.ipynb new file mode 100644 index 0000000..63cb66c --- /dev/null +++ b/examples/plot_waterfall.ipynb @@ -0,0 +1,162 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "ce378089", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from pathlib import Path\n", + "\n", + "import dascore as dc\n", + "\n", + "from das_anomaly.settings import SETTINGS" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcf9eed5", + "metadata": {}, + "outputs": [], + "source": [ + "txt_path = SETTINGS.RESULTS_PATH\n", + "\n", + "waterfall_path = \"path/for/waterfall/plots\"\n", + "waterfall_dir = Path(waterfall_path)\n", + "waterfall_dir.mkdir(parents=True, exist_ok=True) # creates it if it doesn't exist\n", + "\n", + "data_path = SETTINGS.DATA_PATH\n", + "data_unit = SETTINGS.DATA_UNIT\n", + "\n", + "min_freq = SETTINGS.MIN_FREQ \n", + "max_freq = SETTINGS.MAX_FREQ \n", + "step_multiple = SETTINGS.STEP_MULTIPLE \n", + "start_channel = SETTINGS.START_CHANNEL \n", + "end_channel = SETTINGS.END_CHANNEL \n", + "time_window = SETTINGS.TIME_WINDOW\n", + "time_overlap = SETTINGS.TIME_OVERLAP " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9fb4b280", + "metadata": {}, + "outputs": [], + "source": [ + "# Read one path per line\n", + "s = pd.read_csv(txt_path, header=None, names=[\"path\"], dtype=str)\n", + "\n", + "# Extract timestamps\n", + "ts = s[\"path\"].str.extract(\n", + " r'(?P\\d{4}_\\d{2}_\\d{2}T\\d{2}_\\d{2}_\\d{2})__'\n", + " r'(?P\\d{4}_\\d{2}_\\d{2}T\\d{2}_\\d{2}_\\d{2})'\n", + ")\n", + "\n", + "# Build DataFrame with 1-based row numbers\n", + "df = pd.concat([s.index.to_series().add(1).rename(\"rows\"), ts], axis=1)\n", + "\n", + "print(df.head())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "00ce78d2", + "metadata": {}, + "outputs": [], + "source": [ + "def _distance_slice(patch):\n", + " \"\"\"Convert channel range to distance range.\"\"\"\n", + " start = patch.coords[\"distance\"][start_channel]\n", + " end = patch.coords[\"distance\"][end_channel]\n", + " return (start, end)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "368e1837", + "metadata": {}, + "outputs": [], + "source": [ + "sp = dc.spool(data_path)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b53799e9", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "select_time = True\n", + "\n", + "for row in df.itertuples(index=True): # index=False if you don't need the index\n", + " i = row.Index # row number (DataFrame index)\n", + " t1 = pd.to_datetime(row.time_min, format=\"%Y_%m_%dT%H_%M_%S\") \n", + " t2 = pd.to_datetime(row.time_max, format=\"%Y_%m_%dT%H_%M_%S\")\n", + " \n", + " if i>1:\n", + " break\n", + "\n", + " if select_time:\n", + " sub_sp_time = sp.select(time=(t1, t2))\n", + " sub_sp_time_distance = sub_sp_time.select(distance=(_distance_slice(sub_sp_time[0])))\n", + " sub_sp = sub_sp_time_distance\n", + " else:\n", + " sub_sp_distance = sp.select(distance=(_distance_slice(sp[0]))) \n", + " sub_sp = sub_sp_distance\n", + " # chunk into windowed sub-patches\n", + " sub_sp_chunked = sub_sp.sort(\"time\").chunk(\n", + " time=time_window, keep_partial=True,\n", + " )\n", + " if len(sub_sp_chunked) == 0:\n", + " raise ValueError(\"No patch of DAS data found within data path: %s\")\n", + " # iterate over patches and perform preprocessing\n", + " for pa_num, patch in enumerate(sub_sp_chunked):\n", + " if data_unit == \"velocity\":\n", + " proc_patch = patch.velocity_to_strain_rate_edgeless(\n", + " step_multiple=step_multiple\n", + " ).detrend(\"time\")\n", + " elif data_unit == \"strain_rate\":\n", + " proc_patch = patch.detrend(\"time\")\n", + " else:\n", + " raise ValueError(\n", + " \"Unsupported data_unit: {data_unit!r}. \"\n", + " \"Expected 'velocity' or 'strain_rate'.\"\n", + " )\n", + " ax = proc_patch.viz.waterfall(show=False, scale=(-2.5e-6,2.5e-6))\n", + " fig = ax.get_figure() # or: fig = ax.figure\n", + " name = Path(s[\"path\"].iloc[i]).stem\n", + " fig.savefig(waterfall_dir / f\"{name}_{pa_num}.png\", dpi=300, bbox_inches=\"tight\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dasanomaly", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/randomly_select_results.ipynb b/examples/randomly_select_results.ipynb new file mode 100644 index 0000000..0fcec64 --- /dev/null +++ b/examples/randomly_select_results.ipynb @@ -0,0 +1,201 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "0d929088", + "metadata": {}, + "outputs": [], + "source": [ + "import random, re, shutil\n", + "import csv\n", + "import random, re, shutil\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "id": "b0859d24", + "metadata": {}, + "source": [ + "# Randomly select anomalies and normal images separately to evaluate and pick an appropriate threshold for the detector " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38d0f749", + "metadata": {}, + "outputs": [], + "source": [ + "for threshold in [-346, -200, -100, 10, 100, 250, 350, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 5000, 8769]:\n", + " ROOT = Path(f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh_{threshold}/evaluate/\")\n", + " TXT_DIR = f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh_{threshold}\" # directory containing many .txt files\n", + " OUT_DIR_ALL = f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh_{threshold}/evaluate/randomly_selected_all\"\n", + " OUT_DIR_ANOMALY = f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh_{threshold}/evaluate/randomly_selected_anomaly\"\n", + " OUT_DIR_NORMAL = f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh_{threshold}/evaluate/randomly_selected_normal\"\n", + " csv_path = Path(f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh_{threshold}/evaluate/sampled_lines.csv\")\n", + " csv_path.parent.mkdir(parents=True, exist_ok=True)\n", + " N = 500 # how many lines to sample from the merged list\n", + " SEED = 42 # random seed for reproducibility\n", + " FILTER_SUBSTR = None # e.g., \"rank 0\" or \"anomaly\"; use None to skip filtering\n", + "\n", + " # --- Implementation (run as-is) ---\n", + " # Example line:\n", + " # Rank 0 · line 4, /path/to/file.png: The image is normal\n", + " # Regex to pull the PNG path and (optionally) the label\n", + " PNG_AND_LABEL = re.compile(r',\\s*(/[^:]+?\\.png):\\s*The image is\\s+(normal|anomaly)\\s*$', re.IGNORECASE)\n", + " PNG_IN_LINE = re.compile(r',\\s*(/[^:]+?\\.png):')\n", + "\n", + " def unique_dest(outdir: Path, src: Path) -> Path:\n", + " \"\"\"Return a unique destination path in outdir for src (avoid collisions).\"\"\"\n", + " dst = outdir / src.name\n", + " if not dst.exists():\n", + " return dst\n", + " stem, suffix = src.stem, src.suffix\n", + " k = 1\n", + " while True:\n", + " cand = outdir / f\"{stem}_{k}{suffix}\"\n", + " if not cand.exists():\n", + " return cand\n", + " k += 1\n", + "\n", + " # 1) Collect & merge all .txt files\n", + " txt_dir = Path(TXT_DIR)\n", + " all_txts = sorted(txt_dir.glob(\"*.txt\"))\n", + " merged_lines = []\n", + " for tf in all_txts:\n", + " try:\n", + " merged_lines.extend(tf.read_text(encoding=\"utf-8\", errors=\"ignore\").splitlines())\n", + " except Exception as e:\n", + " print(f\"Skipping {tf}: {e}\")\n", + "\n", + " if FILTER_SUBSTR:\n", + " merged_lines = [ln for ln in merged_lines if FILTER_SUBSTR in ln]\n", + "\n", + " # Keep only lines that look like they contain a PNG path (for sampling)\n", + " candidate_lines = [ln for ln in merged_lines if PNG_IN_LINE.search(ln)]\n", + " if not candidate_lines:\n", + " raise RuntimeError(\"No candidate lines with .png paths were found.\")\n", + "\n", + " # 2) Reproducible sampling of lines\n", + " rng = random.Random(SEED)\n", + " n = min(N, len(candidate_lines))\n", + " sampled_lines = rng.sample(candidate_lines, n)\n", + "\n", + " # Prepare output dirs\n", + " out_anom = Path(OUT_DIR_ANOMALY); out_anom.mkdir(parents=True, exist_ok=True)\n", + " out_norm = Path(OUT_DIR_NORMAL); out_norm.mkdir(parents=True, exist_ok=True)\n", + " out_all = Path(OUT_DIR_ALL); out_all.mkdir(parents=True, exist_ok=True)\n", + "\n", + " # 3) Process sampled lines: copy to ALL, then by label\n", + " copied_all, copied_anom, copied_norm, missing = [], [], [], []\n", + " for ln in sampled_lines:\n", + " # Parse path + label (fallback to tail)\n", + " m = PNG_AND_LABEL.search(ln)\n", + " if m:\n", + " png_path = Path(m.group(1))\n", + " label = m.group(2).lower().strip()\n", + " else:\n", + " m2 = PNG_IN_LINE.search(ln)\n", + " if not m2:\n", + " continue\n", + " png_path = Path(m2.group(1))\n", + " tail = ln.strip().lower()\n", + " if tail.endswith(\"anomaly\"): label = \"anomaly\"\n", + " elif tail.endswith(\"normal\"): label = \"normal\"\n", + " else: label = \"\" # unknown label is fine for ALL\n", + "\n", + " if png_path.exists():\n", + " # Copy to ALL\n", + " dst_all = unique_dest(out_all, png_path)\n", + " shutil.copy2(png_path, dst_all)\n", + " copied_all.append((png_path, dst_all))\n", + "\n", + " # Copy by label\n", + " if label == \"anomaly\":\n", + " dst = unique_dest(out_anom, png_path)\n", + " shutil.copy2(png_path, dst)\n", + " copied_anom.append((png_path, dst))\n", + " elif label == \"normal\":\n", + " dst = unique_dest(out_norm, png_path)\n", + " shutil.copy2(png_path, dst)\n", + " copied_norm.append((png_path, dst))\n", + " else:\n", + " missing.append(str(png_path))\n", + "\n", + " # Manifests\n", + " (out_anom / \"copied_manifest_anomaly.csv\").write_text(\n", + " \"src,dst\\n\" + \"\\n\".join(f\"{s},{d}\" for s, d in copied_anom), encoding=\"utf-8\"\n", + " )\n", + " (out_norm / \"copied_manifest_normal.csv\").write_text(\n", + " \"src,dst\\n\" + \"\\n\".join(f\"{s},{d}\" for s, d in copied_norm), encoding=\"utf-8\"\n", + " )\n", + " # Save the exact sampled lines for traceability\n", + " sampled_text = \"\\n\".join(sampled_lines)\n", + " (ROOT / \"sampled_lines.txt\").write_text(sampled_text, encoding=\"utf-8\")\n", + "\n", + "\n", + " rows = []\n", + " for ln in sampled_lines:\n", + " m = PNG_AND_LABEL.search(ln)\n", + " if m:\n", + " png = m.group(1)\n", + " label = m.group(2).lower().strip()\n", + " else:\n", + " m2 = PNG_IN_LINE.search(ln)\n", + " png = m2.group(1) if m2 else \"\"\n", + " tail = ln.strip().lower()\n", + " label = \"anomaly\" if tail.endswith(\"anomaly\") else \"normal\" if tail.endswith(\"normal\") else \"\"\n", + "\n", + " png_name = Path(png).name if png else \"\"\n", + " rows.append((ln, png, label, png_name))\n", + "\n", + " # (Optional) natural sort so numbers in filenames sort as 1,2,10 not 1,10,2\n", + " def _nkey(s): # natural sort key\n", + " parts = re.split(r\"(\\d+)\", s or \"\")\n", + " return [int(p) if p.isdigit() else p.lower() for p in parts]\n", + "\n", + " rows_sorted = sorted(rows, key=lambda r: _nkey(r[3])) # sort by png_name\n", + "\n", + " with csv_path.open(\"w\", newline=\"\", encoding=\"utf-8\") as f:\n", + " w = csv.writer(f)\n", + " w.writerow([\"full_line\", \"png_path\", \"label\", \"png_name\"])\n", + " w.writerows(rows_sorted)\n", + "\n", + " print(f\"Wrote sampled-lines CSV → {csv_path} (sorted by png_name)\")\n", + "\n", + " if missing:\n", + " (out_anom / \"missing.txt\").write_text(\"\\n\".join(missing), encoding=\"utf-8\")\n", + "\n", + " print(f\"Merged {len(all_txts)} .txt files; {len(merged_lines)} lines total \"\n", + " f\"({len(candidate_lines)} with PNGs).\")\n", + " print(f\"Sampled {n} lines → copied {len(copied_anom)} anomalies, {len(copied_norm)} normals; \"\n", + " f\"missing {len(missing)}.\")\n", + " print(f\"Anomaly out dir: {out_anom}\")\n", + " print(f\"Normal out dir: {out_norm}\")\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dasanomaly", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/thresholding_f_score.ipynb b/examples/thresholding_f_score.ipynb new file mode 100644 index 0000000..e230939 --- /dev/null +++ b/examples/thresholding_f_score.ipynb @@ -0,0 +1,721 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "270fa382", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from pathlib import Path\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.patches as patches\n", + "import os\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "937cd1ef", + "metadata": {}, + "outputs": [], + "source": [ + "f1_scores_list = []\n", + "f1_scores_list_random = []\n", + "seismic_acc_list = []\n", + "seismic_to_all_anom_list = []\n", + "recall_list = []\n", + "recall_list_random = []\n", + "precision_list = []\n", + "precision_list_random = []\n", + "f1_scores_r_list = []\n", + "f1_scores_r_list_random = []\n", + "f1_scores_p_list = []\n", + "f1_scores_p_list_random = []\n", + "storate_reduction_list = []\n", + "beta_r = 2\n", + "beta_p = 0.5\n", + "# grab the \"label\" column of each CSV_PATH in the for loop and add it to the end of CSV_PATH_PNGS as \"label\"+str(threshold)\n", + "OUT_PATH_RANDOM = f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/results_thresh/evaluate/good_random/sampled_lines_eval.csv\" # <- fixed f-string\n", + "results_random = pd.read_csv(OUT_PATH_RANDOM)\n", + "OUT_PATH_ANOMALOUS = f\"/scratch/ahmad9/caserm/detect_mse_and_density/model1600/512.256.4/anomalous/results_thresh/evaluate/sampled_lines_eval.csv\" # <- fixed f-string\n", + "results_anomalous = pd.read_csv(OUT_PATH_ANOMALOUS)\n", + "\n", + "# Ensure both DataFrames share the same set of columns (union), filling missing with NaN\n", + "all_cols = sorted(set(results_random.columns) | set(results_anomalous.columns)) \n", + "results_random = results_random.reindex(columns=all_cols)\n", + "for col in (\"label\", \"evaluate\", \"seismic\"):\n", + " results_random[col] = results_random[col].str.strip().str.lower()\n", + "pred_map = {\"y\": \"anomaly\", \"n\": \"normal\"}\n", + "results_random[\"pred\"] = results_random[\"evaluate\"].map(pred_map)\n", + "\n", + "results_anomalous = results_anomalous.reindex(columns=all_cols)\n", + "\n", + "results_all = pd.concat([results_random, results_anomalous], ignore_index=True)\n", + "\n", + "# Read and normalize\n", + "# df_ = pd.read_csv(OUT_PATH, dtype=str)\n", + "for col in (\"label\", \"evaluate\", \"seismic\"):\n", + " results_all[col] = results_all[col].str.strip().str.lower()\n", + "\n", + "\n", + "# Map evaluation -> predicted label\n", + "pred_map = {\"y\": \"anomaly\", \"n\": \"normal\"}\n", + "results_all[\"pred\"] = results_all[\"evaluate\"].map(pred_map)\n", + "\n", + "# Confusion counts\n", + "tp = int(((results_all[\"label\"] == \"anomaly\") & (results_all[\"pred\"] == \"anomaly\")).sum())\n", + "tp_random = int(((results_random[\"label\"] == \"anomaly\") & (results_random[\"pred\"] == \"anomaly\")).sum())\n", + "tn = int(((results_all[\"label\"] == \"normal\") & (results_all[\"pred\"] == \"normal\")).sum())\n", + "tn_random = int(((results_random[\"label\"] == \"normal\") & (results_random[\"pred\"] == \"normal\")).sum())\n", + "fp = int(((results_all[\"label\"] == \"anomaly\") & (results_all[\"pred\"] == \"normal\")).sum())\n", + "fp_random = int(((results_random[\"label\"] == \"anomaly\") & (results_random[\"pred\"] == \"normal\")).sum())\n", + "fn = int(((results_all[\"label\"] == \"normal\") & (results_all[\"pred\"] == \"anomaly\")).sum())\n", + "fn_random = int(((results_random[\"label\"] == \"normal\") & (results_random[\"pred\"] == \"anomaly\")).sum())\n", + "events = int(((results_all[\"label\"] == \"anomaly\") & (results_all[\"seismic\"] == \"y\")).sum()) \n", + "\n", + "print(f\"TP={tp}, TN={tn}, FP={fp}, FN={fn}, SEISMIC={events}\")\n", + "\n", + "precision = tp / (tp + fp)\n", + "precision_random = tp_random / (tp_random + fp_random)\n", + "recall = tp / (tp + fn)\n", + "recall_random = tp_random / (tp_random + fn_random)\n", + "\n", + "if tp==0:\n", + " f1_score = 0\n", + " f1_score_r = 0\n", + " f1_score_p = 0\n", + " seismic_accuracy = 0\n", + " seismic_to_all_anom = 0\n", + "elif tp_random==0:\n", + " storate_reduction = 0\n", + " f1_score_random = 0\n", + " f1_score_r_random = 0\n", + " f1_score_p_random = 0\n", + "else:\n", + " f1_score = 2 * precision * recall / (precision + recall)\n", + " f1_score_random = 2 * precision * recall / (precision + recall)\n", + " f1_score_r = (1+beta_r**2) * precision * recall / (((beta_r**2) * precision) + recall) \n", + " f1_score_r_random = (1+beta_r**2) * precision * recall / (((beta_r**2) * precision) + recall) \n", + " f1_score_p = (1+beta_p**2) * precision * recall / (((beta_p**2) * precision) + recall) \n", + " f1_score_p_random = (1+beta_p**2) * precision * recall / (((beta_p**2) * precision) + recall) \n", + " storate_reduction = results_random.shape[0] / tp_random \n", + " # seismic_accuracy = events / tp\n", + " seismic_accuracy = events / int((results_all[\"seismic\"] == \"y\").sum()) \n", + " seismic_to_all_anom = events / int((results_all[\"label\"] == \"anomaly\").sum()) \n", + "\n", + "print(f\"Precision = {precision}, Recall = {recall}\")\n", + "print(f\"F1 Score = {f1_score}, Seismic Accuracy = {seismic_accuracy}\")\n", + "f1_scores_list.append(f1_score) \n", + "f1_scores_list_random.append(f1_score_random) \n", + "seismic_acc_list.append(seismic_accuracy) \n", + "recall_list.append(recall)\n", + "precision_list.append(precision)\n", + "f1_scores_r_list.append(f1_score_r)\n", + "f1_scores_r_list_random.append(f1_score_r_random)\n", + "f1_scores_p_list.append(f1_score_p)\n", + "f1_scores_p_list_random.append(f1_score_p_random)\n", + "storate_reduction_list.append(storate_reduction)\n", + "seismic_to_all_anom_list.append(seismic_to_all_anom)\n", + "# (Optional) nice confusion table\n", + "conf = pd.crosstab(results_all[\"label\"], results_all[\"pred\"]).reindex(\n", + " index=[\"anomaly\", \"normal\"], columns=[\"anomaly\", \"normal\"]\n", + ").fillna(0).astype(int)\n", + "conf.index.name = \"True\"\n", + "conf.columns.name = \"Pred\"\n", + "display(conf)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e84602ca", + "metadata": {}, + "outputs": [], + "source": [ + "f1_scores = np.asarray(f1_scores_list, dtype=float)\n", + "f1_scores_random = np.asarray(f1_scores_list_random, dtype=float)\n", + "seismic_accuracies = np.asarray(seismic_acc_list, dtype=float)\n", + "recalls = np.asarray(recall_list, dtype=float)\n", + "precisions = np.asarray(precision_list, dtype=float)\n", + "f1_scores_r = np.asarray(f1_scores_r_list, dtype=float)\n", + "f1_scores_r_random = np.asarray(f1_scores_r_list_random, dtype=float)\n", + "f1_scores_p = np.asarray(f1_scores_p_list, dtype=float)\n", + "f1_scores_p_random = np.asarray(f1_scores_p_list_random, dtype=float)\n", + "storate_reductions = np.asarray(storate_reduction_list, dtype=float)\n", + "seismic_to_all_anom = np.asarray(seismic_to_all_anom_list, dtype=float)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "670c68be", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "density_anomaly_txt = \"/home/ahmad9/casermminefiber/event_detection_July_2025/high_freq_low_gl/das-anomaly/examples/validate_and_plot_density/scripts/outputs/anomalous_values_model1600_512_256_4.txt\"\n", + "density_normal_txt = \"/home/ahmad9/casermminefiber/event_detection_July_2025/high_freq_low_gl/das-anomaly/examples/validate_and_plot_density/scripts/outputs/normal_values_model1600_512_256_4.txt\"\n", + "\n", + "density_anomaly = np.array(np.loadtxt(density_anomaly_txt)) \n", + "density_normal = np.array(np.loadtxt(density_normal_txt)) \n", + "merged = np.concatenate([density_anomaly.ravel(), density_normal.ravel()]) \n", + "\n", + "density = np.array([8770, 500, -346])\n", + "# density = np.array(thresholds)\n", + "density_scaled = (density - merged.min()) / (merged.max() - merged.min())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f71ac62", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(seismic_accuracies, f1_scores, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Seismic accuracy (TP&Seismic/Events)', size=12)\n", + "ax.set_ylabel('F1 score', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e9e1dbb", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(seismic_accuracies, f1_scores_p, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Seismic accuracy (TP&Seismic/Events)', size=12)\n", + "ax.set_ylabel('F1 score_p', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# Add dashed rectangle for region 0.9 < x < 1, 0.9 < y < 1\n", + "rect = patches.Rectangle(\n", + " (0.9, 0.9), # bottom-left corner\n", + " 0.1, 0.1, # width and height\n", + " linewidth=1.5,\n", + " edgecolor='red',\n", + " facecolor='none',\n", + " linestyle='--'\n", + ")\n", + "ax.add_patch(rect)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c10d7d36", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(seismic_accuracies, storate_reductions, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Seismic accuracy (TP&Seismic/Events)', size=12)\n", + "ax.set_ylabel('Storage reduction factor', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14fa9758", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(seismic_accuracies, seismic_to_tps, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Seismic accuracy (Detected_seismic_events/Events) - Less missing seismic events', size=12)\n", + "ax.set_ylabel('Detected_seismic_events/TP - Less false trigger for non-seismic events', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10d91a5d", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(seismic_accuracies, seismic_to_all_anom, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Seismic accuracy (Detected_seismic_events/Events) - Less missing seismic events', size=12)\n", + "ax.set_ylabel('Detected_seismic_events/TP - Less false trigger for non-seismic events', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f362d575", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# Ensure arrays are numpy arrays\n", + "x = np.asarray(seismic_accuracies) # (kept your variable name)\n", + "y = np.asarray(seismic_to_all_anom)\n", + "z = np.asarray(density_scaled) # normalized density score (0–1), or any range\n", + "\n", + "vmin, vmax = 0.45, 0.5 # pick the band where most points live\n", + "norm = plt.Normalize(vmin=vmin, vmax=vmax, clip=True)\n", + "\n", + "sc = ax.scatter(x, y, c=z, cmap='plasma', norm=norm,\n", + " s=90, marker='o', edgecolor='black', linewidth=0.6)\n", + "cbar = plt.colorbar(sc, ax=ax, extend='both') # arrows show clipping\n", + "cbar.set_label('Density score (normalized)', size=14)\n", + "\n", + "\n", + "ax.set_xlabel('Seismic event detection coverage', size=14)\n", + "ax.set_ylabel('Seismic event detection precision', size=14)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "# ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b984842", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(storate_reductions, f1_scores_r, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Storage reduction factor', size=12)\n", + "ax.set_ylabel('F1 score_r', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20de83d8", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(storate_reductions, f1_scores_r_random, np.round(density_scaled, 3))):\n", + " if x == 0:\n", + " continue\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Storage reduction factor', size=12)\n", + "ax.set_ylabel('F1 score_r', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92a94708", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(storate_reductions, f1_scores_p, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Storage reduction factor', size=12)\n", + "ax.set_ylabel('F1 score_p', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6491c331", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.colors import Normalize\n", + "from matplotlib.colors import PowerNorm\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# Ensure arrays are numpy arrays\n", + "x = np.asarray(storate_reductions) # (kept your variable name)\n", + "y = np.asarray(f1_scores_r)\n", + "z = np.asarray(density_scaled) # normalized density score (0–1), or any range\n", + "\n", + "vmin, vmax = 0.45, 0.5 # pick the band where most points live\n", + "norm = plt.Normalize(vmin=vmin, vmax=vmax, clip=True)\n", + "\n", + "sc = ax.scatter(x, y, c=z, cmap='plasma', norm=norm,\n", + " s=90, marker='o', edgecolor='black', linewidth=0.6)\n", + "cbar = plt.colorbar(sc, ax=ax, extend='both') # arrows show clipping\n", + "cbar.set_label('Density score (normalized)', size=14)\n", + "# Continuous colormap driven by density score\n", + "# norm = Normalize(vmin=np.nanmin(z), vmax=np.nanmax(z))\n", + "# sc = ax.scatter(\n", + "# x, y,\n", + "# c=z, cmap='viridis', norm=norm, # change 'viridis' if you prefer\n", + "# s=90, marker='o',\n", + "# edgecolor='black', linewidth=0.6\n", + "# )\n", + "\n", + "ax.set_xlabel('Estimated storage reduction factor', size=14)\n", + "ax.set_ylabel('F1 score (beta=2)', size=14)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# Colorbar instead of legend\n", + "# cbar = plt.colorbar(sc, ax=ax)\n", + "# cbar.set_label('Density score (normalized)')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bfc74642", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.colors import Normalize\n", + "from matplotlib.colors import PowerNorm\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# Ensure arrays are numpy arrays\n", + "x = np.asarray(storate_reductions) # (kept your variable name)\n", + "y = np.asarray(f1_scores_p)\n", + "z = np.asarray(density_scaled) # normalized density score (0–1), or any range\n", + "\n", + "vmin, vmax = 0.45, 0.5 # pick the band where most points live\n", + "norm = plt.Normalize(vmin=vmin, vmax=vmax, clip=True)\n", + "\n", + "sc = ax.scatter(x, y, c=z, cmap='plasma', norm=norm,\n", + " s=90, marker='o', edgecolor='black', linewidth=0.6)\n", + "cbar = plt.colorbar(sc, ax=ax, extend='both') # arrows show clipping\n", + "cbar.set_label('Density score (normalized)', size=14)\n", + "# Continuous colormap driven by density score\n", + "# norm = Normalize(vmin=np.nanmin(z), vmax=np.nanmax(z))\n", + "# sc = ax.scatter(\n", + "# x, y,\n", + "# c=z, cmap='viridis', norm=norm, # change 'viridis' if you prefer\n", + "# s=90, marker='o',\n", + "# edgecolor='black', linewidth=0.6\n", + "# )\n", + "\n", + "ax.set_xlabel('Estimated storage reduction factor', size=14)\n", + "ax.set_ylabel('F1 score (beta=0.25)', size=14)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# Colorbar instead of legend\n", + "# cbar = plt.colorbar(sc, ax=ax)\n", + "# cbar.set_label('Density score (normalized)')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9f6d748", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 8))\n", + "\n", + "# exactly 5 filled markers (good contrast with edgecolor)\n", + "markers = ['o', 's', '^', 'D', 'v']\n", + "\n", + "# exactly 4 colors (from current Matplotlib theme)\n", + "colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][:4]\n", + "\n", + "for i, (x, y, thr) in enumerate(zip(recalls, precisions, np.round(density_scaled, 3))):\n", + " m = markers[i % 5] # cycles every point: 0..4, 0..4, ...\n", + " c = colors[(i // 5) % 4] # changes every 5 points: color 0,1,2,3\n", + "\n", + " ax.scatter(\n", + " x, y,\n", + " marker=m,\n", + " s=90,\n", + " facecolor=c,\n", + " edgecolor='black',\n", + " linewidth=0.6,\n", + " label=f\"{thr}\"\n", + " )\n", + "\n", + "ax.set_xlabel('Recall', size=12)\n", + "ax.set_ylabel('Precision', size=12)\n", + "ax.grid(True, alpha=0.3)\n", + "\n", + "# (optional) many labels can clutter; uncomment for a compact legend:\n", + "# handles, labels = ax.get_legend_handles_labels()\n", + "# ax.legend(handles[:10], labels[:10], title=\"Density score threshold (normalized)\", ncol=2)\n", + "\n", + "ax.legend(title=\"Density score threshold (normalized)\")\n", + "plt.show()\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dasanomaly", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}