This repository contains my feature engineering and modeling workflow for the EY challenge focused on forecasting river water quality in South Africa.
My final score (
$R^2$ ) is 0.4079
The challenge goal was to predict three features for descibing water quality:
- total alkalinity,
- electrical conductance (salinity proxy),
- dissolved reactive phosphorus.
The core training data spans 2011-2015 and contains river sampling records with latitude, longitude, sample date, and observed targets from roughly 200 locations in South Africa. The validation set includes coordinates and dates from different regions, so strong solutions must generalize spatially, not just memorize sites.
Beyond accuracy, the challenge asks participants to identify which environmental factors drive water quality variation. Final ranking is based on average
Official page: 2026 Optimizing Clean Water Supply
I combined geospatial feature engineering with practical modeling workflow to improve water quality prediction and spatial generalization across South African river regions.
At a high level, the pipeline extracts:
- Surface/spectral information from Landsat imagery (visible, NIR, SWIR, thermal, and derived water indices).
- Topographic context from Copernicus DEM (elevation, slope, aspect).
- Monthly hydro-climate context from TerraClimate (precipitation, evapotranspiration, drought/soil moisture, temperature and related variables).
It then adds temporal and derived features, organizes outputs into reproducible datasets
EY_data_challenge/
├── README.md
├── requirements.txt
├── data/
│ ├── extraction/
│ │ ├── copernicusDEM_data_extraction.py
│ │ ├── landsat_data_extraction.py
│ │ └── terraclimate_data_extraction.py
│ ├── preprocessing/
│ │ ├── preprocess_landsat.py
│ │ └── preprocess_terraclimate.py
│ └── old/
│ └── EDA.ipynb
└── model/
├── model.ipynb
└── tries.md