This folder contains the complete workflow for preparing oceanographic satellite and buoy data for machine learning model training. The project combines NOAA satellite SST data with buoy-measured water chemistry (pCO2) observations across 7 coastal monitoring locations.
- Goal: Generate cleaned, ML-ready training datasets using measured (non-interpolated) continuous buoy data periods
- Data Sources:
- Satellite SST: JPL MUR (0.042° resolution, ~4.6 km)
- Satellite Chl-a: MODIS-Aqua satellite data
- Buoys: NOAA water chemistry (7 locations, 2013-2025)
- Key Output: Training tables with matched satellite/buoy observations within 4 km spatial grid
## Data Source Credits
- **Chlorophyll-a (chl-a) data:** MODIS-Aqua, NASA Goddard Space Flight Center, Ocean Ecology Laboratory, Ocean Biology Processing Group; (2022): MODIS-Aqua Ocean Color Data, NASA OB.DAAC. https://oceancolor.gsfc.nasa.gov/
- **Sea Surface Temperature (SST):** JPL MUR, NASA PO.DAAC. https://podaac.jpl.nasa.gov/Multi-scale_Ultra-high_Resolution_MUR-SST
- **Buoy data:** NOAA National Data Buoy Center. https://www.ndbc.noaa.gov/
ML2026_Orrand/
├── README.md
├── requirements.txt
├── create_presentation.py
├── pCO2_ML_Presentation.pptx
├── data/
│ ├── DATA_ANALYSIS_WORKFLOW.md
│ ├── DATA_REGENERATION_GUIDE.md
│ ├── processed/
│ │ ├── buoy_continuous_data_periods.csv
│ │ ├── buoy_daily_agg.csv
│ │ ├── buoy_data_cleaned.csv
│ │ ├── combined_satellite_buoy.csv
│ │ ├── DATASET_PRESENTATION_SUMMARY.txt
│ │ ├── ml_data_clean_unscaled.csv
│ │ ├── satellite_sst_cleaned.csv
│ │ ├── satellite_sst_daily.csv
│ │ ├── sat_chla_pixels_test_70rows.csv
│ │ ├── sat_sst_pixels_test_70rows.csv
│ │ ├── training_data_700.csv
│ │ ├── training_data_700_ml_ready.csv
│ │ ├── train_anchors_sample_2000.csv
│ │ └── mur/
│ ├── raw/
│ │ ├── buoy_sources/
│ │ ├── chla/
│ │ ├── mur/
│ │ └── satellite_sources/
│ └── training/
│ ├── ml_data_minmax_scaled.csv
│ └── ml_data_standardized.csv
├── docs/
│ ├── data_exploration_report.md
│ └── NOAA_BUOY_DATA_README.md
├── final project deliverables/
│ ├── Mary Orrand multi panel figure ESS 469.pdf
│ ├── pco2_ml_training_summary.md
│ └── summary_figure_6panel.png
├── notebooks/
│ ├── 00_data_download/
│ │ └── RECOVER_DATA_FROM_NOAA.ipynb
│ ├── 01_data_exploration/
│ │ ├── API attempt SST ERDDAP Orrand.ipynb
│ │ ├── Daily data exploration.ipynb
│ │ ├── Data/
│ │ ├── explore_data.ipynb
│ │ └── NOAA buoy data.ipynb
│ ├── 02_data_preparation/
│ │ ├── data/
│ │ ├── ML_DataPrep_SST_pCO2.ipynb
│ │ └── shrink_raw_sat_data.ipynb
│ ├── 03_model_training/
│ │ ├── ML_Training_Continuous_Data.ipynb
│ │ ├── pco2_ml_training.ipynb
│ │ └── pco2_ml_training_summary.md
│ ├── 04_summary_figure.ipynb
│ └── data/
│ └── processed/
├── plots/
│ ├── 01_satellite_sst_overview.png
│ ├── 02_buoy_data_overview.png
│ ├── 03_ml_dataset_dashboard.png
│ ├── 04_geographic_correlation_analysis.png
│ ├── 05_pco2_data_availability_timeline.png
│ ├── ERDDAP SST.png
│ ├── pmel_carbonuptake.jpg
│ ├── presentation_records_per_location.png
│ ├── presentation_sst_vs_pco2_scatter.png
│ ├── summary_figure_6panel.png
│ ├── analysis/
│ ├── exploration/
│ ├── ml_results/
│ └── training_data_eda/
└── .gitignore
Phase 1: Data Exploration (01_data_exploration/)
- Load NOAA satellite and buoy data
- Inspect data availability and quality across locations
- Identify continuous measurement windows
- Generate exploratory plots
Phase 2: Data Preparation (02_data_preparation/)
- Clean and standardize data (handle -999 nulls, date formats)
- Create master files combining all locations
- Scale features (StandardScaler and MinMaxScaler options)
- Generate quality reports
Phase 3: Training Data Creation (03_model_training/)
- Filter to continuous data periods only (no interpolation)
- Create 4 km spatial grid around each buoy location
- Match satellite observations to buoy dates/locations within grid
- Export ML-ready training tables with features and target variable
pip install -r requirements.txt- Exploration - Start with
notebooks/01_data_exploration/explore_data.ipynb - Preparation - Run
notebooks/02_data_preparation/ML_DataPrep_SST_pCO2.ipynb - Training Data - Execute
notebooks/03_model_training/ML_Training_Continuous_Data.ipynb
In ML_Training_Continuous_Data.ipynb, edit the configuration section:
APPROACH: Choose 'single', 'multi', or 'all' locationsSELECTED_LOCATIONS: List specific buoys to includeGRID_RADIUS_KM: Currently set to 4 km (matches satellite resolution)
-
buoy_data_cleaned.csv: All 7 buoys, all dates, cleaned values
- Columns: datetime, latitude, longitude, sst_celsius, pco2_sw_sat, xco2_sw_dry, location
- ~26,000 records
-
satellite_sst_cleaned.csv: All 6 locations, all dates
- Columns: datetime, latitude, longitude, sst_celsius, location
- ~61,500 records
- buoy_continuous_data_periods.csv: Data availability windows per location
- Identifies continuous measurement periods suitable for training
- ml_training_continuous_data_YYYYMMDD.csv: Final training table
- One row per buoy measurement with matched satellite data
- No NaN values, all measured (no interpolation)
- Ready for ML model training
- Data Quality: All training data uses only measured values; no interpolation or estimation
- Spatial Resolution: 4 km grid radius chosen to match ~4.6 km satellite resolution
- Continuous Periods: Training filtered to date windows where buoys had continuous measurements
- File Paths: Notebooks assume data structure shown above; update paths if reorganizing
- Mary Orrand
For data documentation details, see docs/NOAA_BUOY_DATA_README.md. For the full data processing workflow, see data/DATA_ANALYSIS_WORKFLOW.md.