End-to-end retail analytics project on the Sample Superstore dataset (9,994 transactions, 2014–2017): EDA → statistical tests → K-Means segmentation → per-cluster regression (OLS + Ridge with K-Fold CV). Built with the PACE framework.
| Cluster | Label | Avg Profit | Profile |
|---|---|---|---|
| C0 | Occasional Clients | +$290 | Low volume, low value |
| C1 | VIP Clients ⭐ | +$5,912 | High volume, high profit |
| C2 | 🚨 At-Risk Clients | −$212 LOSS | Over-discounted, erratic |
| C3 | Loyal Clients | +$1,323 | Regular and profitable |
| Cluster | Model | R² test | vs Global OLS (R²=0.34) |
|---|---|---|---|
| C0 — Occasional | OLS | 0.57 | +0.23 |
| C1 — VIP | OLS | 0.99 | +0.65 |
| C2 — At-Risk | Ridge + K-Fold CV | 0.32 | +1.27 (rescued from −0.96) |
| C3 — Loyal | OLS | 0.80 | +0.46 |
Per-cluster modeling beats the global model on every single segment.
🎯 Cap Furniture discounts at 20% → quantified opportunity: ~$23,318 recoverable profit (317 loss orders × $73.56 average margin gap, validated by Chi² with Cramér's V).
Three complementary tests confirm EDA findings on a rigorous basis:
| Test | Purpose | Result |
|---|---|---|
| T-Test (Welch) | Discount impact on profit | p ≈ 0 · No discount: +$66.90 · With discount: −$6.66 |
| ANOVA + Tukey HSD | Category effect on profit | Tech >> Furniture ≈ Office Supplies · gap = $70.05 |
| Chi² + Cramér's V | Category × loss rate | Furniture shows +317 loss orders vs expected |
| Stage | Action |
|---|---|
| Plan | Define segmentation + profit modeling goals · dataset audit |
| Analyze | Descriptive stats · distributions · outlier flagging (not removal) · discount/profit scatter · 3 statistical tests |
| Construct | Customer aggregation · RobustScaler · Elbow + Silhouette for k selection · K-Means(k=4) · Per-cluster regression (OLS for C0/C1/C3, Ridge + K-Fold CV for C2) |
| Execute | Final comparison table · business recommendations · export for Power BI |
| Decision | Choice | Reasoning |
|---|---|---|
| Outlier treatment | Flag, not remove | Large B2B sales ($8,159) are legitimate, not errors |
| K-Means scaling | RobustScaler (not Standard) | Preserves real distances; not dominated by outliers |
| Ridge scaling | StandardScaler | L2 penalty requires equal feature scales |
| k selection | k=4 | Elbow + Silhouette + business logic (4 interpretable profiles) |
| Regression scope | Per cluster (Approach B) | VIP and At-Risk have opposite profit dynamics → one global model fails |
| C2 model | Ridge + K-Fold CV | Initial OLS had R²=−0.96 (catastrophic overfitting) — Ridge shrinks unstable coefficients |
| SMOTE | Rejected | SMOTE is for classification, not regression. Overfitting solved with Ridge, not more data |
See challenges_and_solutions.md for the full log of analytical decisions.
superstore_analysis.ipynb— Complete PACE analysissuperstore_analysis.py— Jupytext version (VS Code friendly)challenges_and_solutions.md— Analytical decisions log
Python · pandas · numpy · scipy (T-Test, ANOVA, Chi²) · statsmodels (OLS, Tukey HSD) · scikit-learn (K-Means, Ridge, K-Fold CV) · matplotlib · seaborn
git clone https://github.com/Khalifa160/superstore-segmentation.git
cd superstore-segmentation
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Download "Sample - Superstore.csv" (Kaggle: vivek468/superstore-dataset-final)
# and place it next to the notebook
jupyter notebook superstore_analysis.ipynb- Cap Furniture discounts at 20% — quantified lever: ~$23,318 recoverable profit
- Prioritize Technology category — $70 higher profit per order vs Furniture (Tukey HSD validated)
- Segment marketing by cluster:
- VIP (C1): Exclusive loyalty programs, account managers
- Loyal (C3): Cross-sell and referral campaigns
- Occasional (C0): Reactivation promotions
- 🚨 At-Risk (C2): Emergency discount policy review — STOP over-discounting
- Monitor At-Risk customers monthly — stricter approval for discounts > 30%
Segmentation drives product and pricing strategy, not individual discrimination. Cluster labels describe business behavior, not customer worth. At-Risk clients should receive targeted support to become profitable — never worse service.
- Cluster 2 remains the hardest to model linearly. Next step: tree-based regressor (Random Forest, XGBoost) to capture non-linearities
- No temporal features used (seasonality). Adding month/quarter could improve all models
- Dataset is 2014–2017 — production deployment requires fresh data and online retraining
El khlife Messoud — Data Analyst, Nouakchott, Mauritania