🛒 Probabilistic Retail Demand Forecasting - Data Science Challenge

🧩 The Problem

Retailers face a fundamental challenge every day: how many units of each product will customers actually buy?

Order too much and you tie up capital in excess inventory that may expire or go unsold. Order too little and you face stockouts, lost revenue, and frustrated customers. Getting this balance right requires accurate demand forecasting — but real-world demand is inherently uncertain.

This project tackles that challenge for 10 item-store combinations across a retail chain, forecasting daily demand for the period September 12–18, 2022 (7 days). The dataset spans January 2019 through September 11, 2022 and includes sales transactions, regular prices, and promotional prices.

📊 What Is Probabilistic Demand Forecasting?

Most forecasting models produce a single point estimate — e.g., "we expect to sell 15 units on Monday." While convenient, this approach hides critical information: how confident is the model? Is the true demand likely to be 12–18, or could it plausibly be anywhere from 5 to 40?

Probabilistic forecasting instead produces a full probability distribution over all possible demand values for each day. For example:

P(demand = 0)  = 0.02
P(demand = 10) = 0.08
P(demand = 15) = 0.18   ← most likely
P(demand = 20) = 0.12
...

This richer output enables smarter decisions:

Use case	How a distribution helps
📦 Inventory replenishment	Set safety stock at the 90th percentile of demand
💰 Revenue optimisation	Maximise expected profit across the whole distribution
⚠️ Risk management	Quantify the probability of a stockout or overstock event
📈 Promotion planning	Understand the spread of outcomes under a promotional scenario

The primary evaluation metric in this project is CRPS (Continuous Ranked Probability Score) — a proper scoring rule that rewards calibrated, sharp distributions. Unlike MAE or RMSE, CRPS penalises both overconfident and underconfident forecasts, making it the gold standard for evaluating probabilistic models.

🔍 Dataset Overview

The raw data consists of three CSV files:

File	Description
`sales.csv`	Daily unit sales per item-store combination (Jan 2019 – Sep 11, 2022)
`regular_price.csv`	The standard shelf price for each item-store pair over time
`promo_price.csv`	The promotional price when a promotion is active (NaN = no promotion)

Key characteristics of the data:

🟢 5 dense combos (items 1, 2, 4): steady daily sales of 10–30 units
🔴 5 sparse combos (items 5, 6, 7): 91–99% zero-sales days — demand is rare but real
📅 Clear weekly seasonality (weekends peak for most combos)
🏷️ Promotions cause 2–5× demand spikes relative to regular sales
🚨 One anomalous spike in item1_store1 (~68 units, far above the typical range)

⚙️ Requirements

Python 3.10+

pip install -r requirements.txt

Key packages: pandas, numpy, scikit-learn, statsmodels, lightgbm, xgboost>=2.0, scipy, PyYAML, joblib, matplotlib, seaborn, plotly, holidays, optuna>=3.0

🚀 How to Run

Full pipeline

python run_pipeline.py

Runs all steps in sequence: clean → features → train → predict.

Individual steps

python run_pipeline.py --step clean      # Data loading + cleaning
python run_pipeline.py --step features   # Feature engineering
python run_pipeline.py --step tune       # Hyperparameter tuning (Optuna, opt-in)
python run_pipeline.py --step train      # Walk-forward CV + final model training
python run_pipeline.py --step predict    # Forecast generation

Each step reads inputs from intermediate files in data/, so steps can be re-run independently after the pipeline has been run at least once end-to-end.

--step tune is not included in --step all. Run it explicitly before --step train to search for better hyperparameters. Results are saved to models/best_params.yaml and automatically loaded on the next --step train.

🗂️ Project Structure

├── config.yaml                      # All parameters — edit this, not Python files
├── run_pipeline.py                  # Entry point
├── requirements.txt
├── data/
│   ├── 01-raw/                      # Original CSVs (never modified)
│   │   ├── sales.csv
│   │   ├── regular_price.csv
│   │   └── promo_price.csv
│   ├── 02-preprocessed/
│   │   └── cleansed_dataset.csv
│   ├── 03-features/
│   │   └── features.csv             # Full feature matrix (all combos, all dates)
│   └── 04-results/
│       ├── forecast_poisson_glm.csv
│       ├── forecast_lgbm_poisson.csv
│       ├── forecast_xgb_quantile.csv
│       ├── cv_results.csv           # Per-fold CV metrics
│       └── cv_summary.csv           # Mean CV metrics per model
├── src/
│   ├── utils.py                     # Config loader, get_feature_cols, param helpers
│   ├── data_loading.py              # Load & merge raw CSVs
│   ├── data_cleaning.py             # STL anomaly detection + treatment
│   ├── feature_engineering.py       # Calendar, lag, rolling, price, event features
│   ├── model_training.py            # Train 3 probabilistic models + walk-forward CV
│   ├── model_evaluation.py          # CRPS, Log Score, MAE/RMSE, diagnostic plots
│   ├── hyperparameter_tuning.py     # Optuna-based hyperparameter search
│   └── prediction.py               # Forecast feature construction + distribution generation
├── models/                          # Saved trained models (.pkl) + best_params.yaml
├── outputs/
│   └── figures/                     # Diagnostic plots (feature importance, intervals)
├── notebooks/
│   ├── Kivos_EDA.ipynb
│   └── Kivos_Data_Cleansing.ipynb
└── reports/

🏗️ Approach

1. 📂 Data Loading and EDA

Three CSV files are merged into a per-combo dictionary (one DataFrame per item-store pair), with a complete daily grid for each combination. The EDA notebook (Kivos_EDA.ipynb) covers:

Sales distribution by item, store, and time period
Promotion frequency and average discount depth
Seasonal patterns and year-over-year trends
Identification of sparse vs. dense demand combos
Correlation between promotions and demand spikes

Key findings:

5 dense combos (items 1, 2, 4) with daily sales of 10–30 units
5 sparse combos (items 5, 6, 7) with 91–99% zero-sales days
Clear weekly seasonality (weekends peak for most combos)
Promotions cause demand spikes of 2–5× regular sales
One anomalous spike in item1_store1 (~68 units, far above the typical range)

2. 🧹 Data Cleansing

Approach: STL (Seasonal-Trend decomposition using LOESS) on the sales time series per combo. Residuals are scored with a z-score; values exceeding the threshold (z > 5.0) are flagged as anomalies and replaced with the STL-expected value (trend + seasonal component), clipped to ≥ 0.

What counts as an outlier: A sales day whose residual (after removing trend and seasonality) is more than 5 standard deviations from the mean, and which cannot be explained by an active promotion, public holiday, or Black Friday. Promotions and holidays are explicitly excluded from flagging since those spikes are genuine demand events.

Special handling for sparse combos (items 5, 6, 7): These combos have 91–99% zero-sales days. Their rare non-zero sales are genuine demand events, not statistical outliers. Anomaly treatment is skipped for them (configurable via cleaning.sparse_combo_start_index).

Missing sales values are interpolated linearly and rounded to integers. Prices are forward/back-filled; promo_price NaN → 0 (no promotion).

3. 🔧 Feature Engineering

All features are forecast-safe: they can be computed for Sept 12–18, 2022 using only data available at that time.

Feature group	Features	Notes
📅 Calendar	day_of_week, month, week_of_year, day_of_month	Basic time structure
⏳ Lag sales	sales_lag_7, sales_lag_14, sales_lag_28	Lags 1–6 excluded (unknown for forecast days 2–7)
📉 Rolling stats	sales_rolling_mean/std for windows 7, 14, 28	Computed with shift(1) to prevent leakage
🏷️ Price	is_promo, discount_pct, effective_price	From regular_price and promo_price
🔖 Identity	item_id_encoded, store_id_encoded	LabelEncoder; treated as categoricals by tree models
🎉 Events	is_holiday, is_black_friday	Greek public holidays + Orthodox Easter + Black Friday
😷 COVID	is_covid	Binary flag for Mar–Jul 2020 lockdown period
🕐 Temporal	days_since_last_sale, days_since_last_promo	Shift(1) prevents leakage
🔮 Forward-looking	days_before_holiday	Calendar days until next holiday
🏪 Cross-sectional	store_total_sales_lag_7	Sum of lag-7 sales across all items in the same store

Additional features that could improve forecasts (given access to more data):

🌤️ Weather data (temperature, rainfall affect foot traffic)
🏪 Competitor pricing / promotions
🚶 Store-level foot traffic or transaction count
🎵 Public events near stores (concerts, sports events)
📦 Product category / subcategory hierarchy
🛒 Shelf space or display prominence changes

4. 🤖 Model Selection, Training, and Evaluation

Models

Three probabilistic models were selected — each naturally produces a distribution over discrete demand values without post-hoc approximation:

Model	Mechanism	Key property
`poisson_glm`	Statsmodels GLM with Poisson family → predicts λ → Poisson PMF	Interpretable baseline; linear in features
`lgbm_poisson`	LightGBM with `objective="poisson"` → predicts λ → Poisson PMF	Non-linear; captures interactions; fast
`xgb_quantile`	XGBoost `objective="reg:quantileerror"` (11 quantiles) → piecewise CDF interpolation → PMF	Distribution-free; captures heavy tails

Item/store encodings are treated as unordered categoricals (native categorical support in LightGBM and XGBoost) rather than ordinal integers.

Per-combo max_k (PMF truncation) is set to ceil(max_historical_sales × 1.2), ensuring every historical demand value is representable in the distribution.

Training and Evaluation

A walk-forward expanding-window cross-validation (3 folds, 7-day validation window each) was used instead of a single holdout split. This gives a more robust and realistic estimate of performance on the forecast window.

Fold 1: train Jan 2019 – Aug 21, 2022  |  val Aug 22–28, 2022
Fold 2: train Jan 2019 – Aug 28, 2022  |  val Aug 29 – Sep 4, 2022
Fold 3: train Jan 2019 – Sep 4, 2022   |  val Sep 5–11, 2022

Final models are retrained on the full dataset (Jan 2019 – Sep 11, 2022) to maximise the information available for the forecast.

Hyperparameters were tuned with Optuna (50 trials per model, TPE sampler, minimising mean CRPS across folds). Best parameters are saved to models/best_params.yaml.

📐 Evaluation Metrics

Metric	What it measures
CRPS (Continuous Ranked Probability Score)	Overall distribution accuracy — the primary metric. Lower is better. Proper scoring rule.
Log Score	Mean log-likelihood of the true value under the predicted distribution. Higher (less negative) is better.
MAE	Mean absolute error of the distribution mean as a point forecast
RMSE	Root mean squared error of the distribution mean

🏆 Results (mean across 3 CV folds)

Model	CRPS	Log Score	MAE	RMSE	CRPS Skill
lgbm_poisson	0.5870	-1.1287	0.8656	1.4702	38.33%
xgb_quantile	0.5875	-1.1115	0.9925	1.5399	38.27%
poisson_glm	0.6452	-1.2429	0.9762	1.5636	32.21%
empirical_baseline	0.9518	-1.2896	1.6097	2.6157	0.00%

CRPS Skill = (baseline_crps − model_crps) / baseline_crps × 100. All three models beat the empirical baseline by ~32–38%, confirming they are learning meaningful patterns.

lgbm_poisson and xgb_quantile are essentially tied on CRPS (~38.3% skill). lgbm_poisson edges ahead on CRPS, MAE, and RMSE; xgb_quantile has a slightly better log score. poisson_glm is the weakest on all metrics, as expected for a linear model, but provides interpretable coefficients.

5. 📤 Output Format

One CSV per model in data/04-results/, each with 70 rows (10 combos × 7 days):

date,item,store,prediction
20220912,item1,store1,"{0: 0.02, 1: 0.05, 2: 0.10, 3: 0.18, ...}"
...

⚙️ Configuration

All parameters live in config.yaml. Key settings:

Key	Default	Description
`cleaning.stl_period`	28	STL decomposition period (days)
`cleaning.anomaly_method`	zscore	Outlier detection method
`cleaning.anomaly_threshold`	5.0	Z-score cutoff
`cleaning.sparse_combo_start_index`	5	Combos at/above this index skip anomaly treatment
`features.lag_days`	[7, 14, 28]	Lag offsets in days
`features.rolling_windows`	[7, 14, 28]	Rolling stat window sizes
`features.categorical_features`	[item_id_encoded, store_id_encoded]	Treated as categoricals by tree models
`split.n_folds`	3	Number of CV folds
`split.val_days`	7	Validation window size per fold
`models.xgb.quantile_levels`	[0.05, …, 0.95]	Quantile levels for XGBoost
`tuning.n_trials`	50	Optuna trials per model
`forecast.start_date` / `end_date`	2022-09-12 / 2022-09-18	Forecast window

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛒 Probabilistic Retail Demand Forecasting - Data Science Challenge

🧩 The Problem

📊 What Is Probabilistic Demand Forecasting?

🔍 Dataset Overview

⚙️ Requirements

🚀 How to Run

Full pipeline

Individual steps

🗂️ Project Structure

🏗️ Approach

1. 📂 Data Loading and EDA

2. 🧹 Data Cleansing

3. 🔧 Feature Engineering

4. 🤖 Model Selection, Training, and Evaluation

Models

Training and Evaluation

📐 Evaluation Metrics

🏆 Results (mean across 3 CV folds)

5. 📤 Output Format

⚙️ Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
models		models
notebooks		notebooks
outputs/figures		outputs/figures
reports		reports
src		src
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

🛒 Probabilistic Retail Demand Forecasting - Data Science Challenge

🧩 The Problem

📊 What Is Probabilistic Demand Forecasting?

🔍 Dataset Overview

⚙️ Requirements

🚀 How to Run

Full pipeline

Individual steps

🗂️ Project Structure

🏗️ Approach

1. 📂 Data Loading and EDA

2. 🧹 Data Cleansing

3. 🔧 Feature Engineering

4. 🤖 Model Selection, Training, and Evaluation

Models

Training and Evaluation

📐 Evaluation Metrics

🏆 Results (mean across 3 CV folds)

5. 📤 Output Format

⚙️ Configuration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages