Retailers face a fundamental challenge every day: how many units of each product will customers actually buy?
Order too much and you tie up capital in excess inventory that may expire or go unsold. Order too little and you face stockouts, lost revenue, and frustrated customers. Getting this balance right requires accurate demand forecasting — but real-world demand is inherently uncertain.
This project tackles that challenge for 10 item-store combinations across a retail chain, forecasting daily demand for the period September 12–18, 2022 (7 days). The dataset spans January 2019 through September 11, 2022 and includes sales transactions, regular prices, and promotional prices.
Most forecasting models produce a single point estimate — e.g., "we expect to sell 15 units on Monday." While convenient, this approach hides critical information: how confident is the model? Is the true demand likely to be 12–18, or could it plausibly be anywhere from 5 to 40?
Probabilistic forecasting instead produces a full probability distribution over all possible demand values for each day. For example:
P(demand = 0) = 0.02
P(demand = 10) = 0.08
P(demand = 15) = 0.18 ← most likely
P(demand = 20) = 0.12
...
This richer output enables smarter decisions:
| Use case | How a distribution helps |
|---|---|
| 📦 Inventory replenishment | Set safety stock at the 90th percentile of demand |
| 💰 Revenue optimisation | Maximise expected profit across the whole distribution |
| Quantify the probability of a stockout or overstock event | |
| 📈 Promotion planning | Understand the spread of outcomes under a promotional scenario |
The primary evaluation metric in this project is CRPS (Continuous Ranked Probability Score) — a proper scoring rule that rewards calibrated, sharp distributions. Unlike MAE or RMSE, CRPS penalises both overconfident and underconfident forecasts, making it the gold standard for evaluating probabilistic models.
The raw data consists of three CSV files:
| File | Description |
|---|---|
sales.csv |
Daily unit sales per item-store combination (Jan 2019 – Sep 11, 2022) |
regular_price.csv |
The standard shelf price for each item-store pair over time |
promo_price.csv |
The promotional price when a promotion is active (NaN = no promotion) |
Key characteristics of the data:
- 🟢 5 dense combos (items 1, 2, 4): steady daily sales of 10–30 units
- 🔴 5 sparse combos (items 5, 6, 7): 91–99% zero-sales days — demand is rare but real
- 📅 Clear weekly seasonality (weekends peak for most combos)
- 🏷️ Promotions cause 2–5× demand spikes relative to regular sales
- 🚨 One anomalous spike in item1_store1 (~68 units, far above the typical range)
- Python 3.10+
pip install -r requirements.txtKey packages: pandas, numpy, scikit-learn, statsmodels, lightgbm, xgboost>=2.0,
scipy, PyYAML, joblib, matplotlib, seaborn, plotly, holidays, optuna>=3.0
python run_pipeline.pyRuns all steps in sequence: clean → features → train → predict.
python run_pipeline.py --step clean # Data loading + cleaning
python run_pipeline.py --step features # Feature engineering
python run_pipeline.py --step tune # Hyperparameter tuning (Optuna, opt-in)
python run_pipeline.py --step train # Walk-forward CV + final model training
python run_pipeline.py --step predict # Forecast generationEach step reads inputs from intermediate files in
data/, so steps can be re-run independently after the pipeline has been run at least once end-to-end.
--step tuneis not included in--step all. Run it explicitly before--step trainto search for better hyperparameters. Results are saved tomodels/best_params.yamland automatically loaded on the next--step train.
├── config.yaml # All parameters — edit this, not Python files
├── run_pipeline.py # Entry point
├── requirements.txt
├── data/
│ ├── 01-raw/ # Original CSVs (never modified)
│ │ ├── sales.csv
│ │ ├── regular_price.csv
│ │ └── promo_price.csv
│ ├── 02-preprocessed/
│ │ └── cleansed_dataset.csv
│ ├── 03-features/
│ │ └── features.csv # Full feature matrix (all combos, all dates)
│ └── 04-results/
│ ├── forecast_poisson_glm.csv
│ ├── forecast_lgbm_poisson.csv
│ ├── forecast_xgb_quantile.csv
│ ├── cv_results.csv # Per-fold CV metrics
│ └── cv_summary.csv # Mean CV metrics per model
├── src/
│ ├── utils.py # Config loader, get_feature_cols, param helpers
│ ├── data_loading.py # Load & merge raw CSVs
│ ├── data_cleaning.py # STL anomaly detection + treatment
│ ├── feature_engineering.py # Calendar, lag, rolling, price, event features
│ ├── model_training.py # Train 3 probabilistic models + walk-forward CV
│ ├── model_evaluation.py # CRPS, Log Score, MAE/RMSE, diagnostic plots
│ ├── hyperparameter_tuning.py # Optuna-based hyperparameter search
│ └── prediction.py # Forecast feature construction + distribution generation
├── models/ # Saved trained models (.pkl) + best_params.yaml
├── outputs/
│ └── figures/ # Diagnostic plots (feature importance, intervals)
├── notebooks/
│ ├── Kivos_EDA.ipynb
│ └── Kivos_Data_Cleansing.ipynb
└── reports/
Three CSV files are merged into a per-combo dictionary (one DataFrame per item-store pair),
with a complete daily grid for each combination. The EDA notebook (Kivos_EDA.ipynb)
covers:
- Sales distribution by item, store, and time period
- Promotion frequency and average discount depth
- Seasonal patterns and year-over-year trends
- Identification of sparse vs. dense demand combos
- Correlation between promotions and demand spikes
Key findings:
- 5 dense combos (items 1, 2, 4) with daily sales of 10–30 units
- 5 sparse combos (items 5, 6, 7) with 91–99% zero-sales days
- Clear weekly seasonality (weekends peak for most combos)
- Promotions cause demand spikes of 2–5× regular sales
- One anomalous spike in item1_store1 (~68 units, far above the typical range)
Approach: STL (Seasonal-Trend decomposition using LOESS) on the sales time series per combo. Residuals are scored with a z-score; values exceeding the threshold (z > 5.0) are flagged as anomalies and replaced with the STL-expected value (trend + seasonal component), clipped to ≥ 0.
What counts as an outlier: A sales day whose residual (after removing trend and seasonality) is more than 5 standard deviations from the mean, and which cannot be explained by an active promotion, public holiday, or Black Friday. Promotions and holidays are explicitly excluded from flagging since those spikes are genuine demand events.
Special handling for sparse combos (items 5, 6, 7): These combos have 91–99% zero-sales
days. Their rare non-zero sales are genuine demand events, not statistical outliers.
Anomaly treatment is skipped for them (configurable via cleaning.sparse_combo_start_index).
Missing sales values are interpolated linearly and rounded to integers. Prices are forward/back-filled; promo_price NaN → 0 (no promotion).
All features are forecast-safe: they can be computed for Sept 12–18, 2022 using only data available at that time.
| Feature group | Features | Notes |
|---|---|---|
| 📅 Calendar | day_of_week, month, week_of_year, day_of_month | Basic time structure |
| ⏳ Lag sales | sales_lag_7, sales_lag_14, sales_lag_28 | Lags 1–6 excluded (unknown for forecast days 2–7) |
| 📉 Rolling stats | sales_rolling_mean/std for windows 7, 14, 28 | Computed with shift(1) to prevent leakage |
| 🏷️ Price | is_promo, discount_pct, effective_price | From regular_price and promo_price |
| 🔖 Identity | item_id_encoded, store_id_encoded | LabelEncoder; treated as categoricals by tree models |
| 🎉 Events | is_holiday, is_black_friday | Greek public holidays + Orthodox Easter + Black Friday |
| 😷 COVID | is_covid | Binary flag for Mar–Jul 2020 lockdown period |
| 🕐 Temporal | days_since_last_sale, days_since_last_promo | Shift(1) prevents leakage |
| 🔮 Forward-looking | days_before_holiday | Calendar days until next holiday |
| 🏪 Cross-sectional | store_total_sales_lag_7 | Sum of lag-7 sales across all items in the same store |
Additional features that could improve forecasts (given access to more data):
- 🌤️ Weather data (temperature, rainfall affect foot traffic)
- 🏪 Competitor pricing / promotions
- 🚶 Store-level foot traffic or transaction count
- 🎵 Public events near stores (concerts, sports events)
- 📦 Product category / subcategory hierarchy
- 🛒 Shelf space or display prominence changes
Three probabilistic models were selected — each naturally produces a distribution over discrete demand values without post-hoc approximation:
| Model | Mechanism | Key property |
|---|---|---|
poisson_glm |
Statsmodels GLM with Poisson family → predicts λ → Poisson PMF | Interpretable baseline; linear in features |
lgbm_poisson |
LightGBM with objective="poisson" → predicts λ → Poisson PMF |
Non-linear; captures interactions; fast |
xgb_quantile |
XGBoost objective="reg:quantileerror" (11 quantiles) → piecewise CDF interpolation → PMF |
Distribution-free; captures heavy tails |
Item/store encodings are treated as unordered categoricals (native categorical support in LightGBM and XGBoost) rather than ordinal integers.
Per-combo max_k (PMF truncation) is set to ceil(max_historical_sales × 1.2), ensuring
every historical demand value is representable in the distribution.
A walk-forward expanding-window cross-validation (3 folds, 7-day validation window each) was used instead of a single holdout split. This gives a more robust and realistic estimate of performance on the forecast window.
Fold 1: train Jan 2019 – Aug 21, 2022 | val Aug 22–28, 2022
Fold 2: train Jan 2019 – Aug 28, 2022 | val Aug 29 – Sep 4, 2022
Fold 3: train Jan 2019 – Sep 4, 2022 | val Sep 5–11, 2022
Final models are retrained on the full dataset (Jan 2019 – Sep 11, 2022) to maximise the information available for the forecast.
Hyperparameters were tuned with Optuna (50 trials per model, TPE sampler, minimising
mean CRPS across folds). Best parameters are saved to models/best_params.yaml.
| Metric | What it measures |
|---|---|
| CRPS (Continuous Ranked Probability Score) | Overall distribution accuracy — the primary metric. Lower is better. Proper scoring rule. |
| Log Score | Mean log-likelihood of the true value under the predicted distribution. Higher (less negative) is better. |
| MAE | Mean absolute error of the distribution mean as a point forecast |
| RMSE | Root mean squared error of the distribution mean |
| Model | CRPS | Log Score | MAE | RMSE | CRPS Skill |
|---|---|---|---|---|---|
| lgbm_poisson | 0.5870 | -1.1287 | 0.8656 | 1.4702 | 38.33% |
| xgb_quantile | 0.5875 | -1.1115 | 0.9925 | 1.5399 | 38.27% |
| poisson_glm | 0.6452 | -1.2429 | 0.9762 | 1.5636 | 32.21% |
| empirical_baseline | 0.9518 | -1.2896 | 1.6097 | 2.6157 | 0.00% |
CRPS Skill = (baseline_crps − model_crps) / baseline_crps × 100. All three models
beat the empirical baseline by ~32–38%, confirming they are learning meaningful patterns.
lgbm_poisson and xgb_quantile are essentially tied on CRPS (~38.3% skill). lgbm_poisson
edges ahead on CRPS, MAE, and RMSE; xgb_quantile has a slightly better log score. poisson_glm is the
weakest on all metrics, as expected for a linear model, but provides interpretable coefficients.
One CSV per model in data/04-results/, each with 70 rows (10 combos × 7 days):
date,item,store,prediction
20220912,item1,store1,"{0: 0.02, 1: 0.05, 2: 0.10, 3: 0.18, ...}"
...
All parameters live in config.yaml. Key settings:
| Key | Default | Description |
|---|---|---|
cleaning.stl_period |
28 | STL decomposition period (days) |
cleaning.anomaly_method |
zscore | Outlier detection method |
cleaning.anomaly_threshold |
5.0 | Z-score cutoff |
cleaning.sparse_combo_start_index |
5 | Combos at/above this index skip anomaly treatment |
features.lag_days |
[7, 14, 28] | Lag offsets in days |
features.rolling_windows |
[7, 14, 28] | Rolling stat window sizes |
features.categorical_features |
[item_id_encoded, store_id_encoded] | Treated as categoricals by tree models |
split.n_folds |
3 | Number of CV folds |
split.val_days |
7 | Validation window size per fold |
models.xgb.quantile_levels |
[0.05, …, 0.95] | Quantile levels for XGBoost |
tuning.n_trials |
50 | Optuna trials per model |
forecast.start_date / end_date |
2022-09-12 / 2022-09-18 | Forecast window |