Summary
Establish reliable data access and storage pipeline for ERA5 reanalysis data and NOAA climate indices. This phase creates the foundation for all subsequent analysis.
Parent Issue: #1
Objectives
- Download and cache ERA5 sea level pressure and geopotential height data
- Load and parse NOAA climate indices (NAO, AO, ONI, PDO)
- Implement preprocessing (anomaly computation, regridding)
- Create comprehensive test suite
System Context
data/
├── raw/ # Downloaded NetCDF files
│ └── era5/
├── processed/ # Anomalies, regridded data
└── external/ # NOAA indices, EM-DAT
Files to Create/Modify
| File |
Action |
Description |
src/data/download.py |
Create |
ERA5Downloader class with CDS API |
src/data/loaders.py |
Modify |
Add ERA5 loader, EM-DAT loader |
src/data/preprocessing.py |
Create |
AnomalyCalculator, Regridder classes |
tests/test_data.py |
Modify |
Add download and preprocessing tests |
Implementation Checklist
CDS API Setup
ERA5 Downloader
NOAA Index Loader (Partially Complete)
Preprocessing
Testing
Code Snippets
ERA5 Download Request
# src/data/download.py
def _build_request(self, variable: str, year: int, month: int) -> dict:
"""Build CDS API request for ERA5 monthly means."""
return {
"product_type": "monthly_averaged_reanalysis",
"variable": variable,
"year": str(year),
"month": f"{month:02d}",
"time": "00:00",
"format": "netcdf",
}
Anomaly Calculation
# src/data/preprocessing.py
def compute_anomalies(data: xr.DataArray) -> xr.DataArray:
"""Remove monthly climatology from data.
Args:
data: DataArray with time dimension
Returns:
Anomalies (deviations from monthly mean)
"""
climatology = data.groupby("time.month").mean("time")
anomalies = data.groupby("time.month") - climatology
return anomalies
Verification
# Test ERA5 download
python -m src.data.download --variable msl --year 2020 --month 1 --dry-run
# Verify NOAA loader
python -c "from src.data.loaders import NOAAIndexLoader; print(NOAAIndexLoader().load_index('NAO').head())"
# Run tests
pytest tests/test_data.py -v
Technical Challenges
| Challenge |
Mitigation |
| ERA5 downloads slow |
Start with NCEP (smaller), use dask for lazy loading |
| CDS rate limits |
Implement exponential backoff, queue requests |
| NetCDF memory issues |
Use dask chunking from start |
| Network failures |
Checkpointing, automatic retry |
Definition of Done
Summary
Establish reliable data access and storage pipeline for ERA5 reanalysis data and NOAA climate indices. This phase creates the foundation for all subsequent analysis.
Parent Issue: #1
Objectives
System Context
Files to Create/Modify
src/data/download.pysrc/data/loaders.pysrc/data/preprocessing.pytests/test_data.pyImplementation Checklist
CDS API Setup
ERA5 Downloader
ERA5DownloaderclassNOAA Index Loader (Partially Complete)
NOAAIndexLoaderclassPreprocessing
AnomalyCalculator(remove climatological mean)Regridderfor resolution standardizationTesting
Code Snippets
ERA5 Download Request
Anomaly Calculation
Verification
Technical Challenges
Definition of Done
pytest tests/test_data.py