🛒 Superstore Sales — Customer Segmentation & Profit Modeling

End-to-end retail analytics project on the Sample Superstore dataset (9,994 transactions, 2014–2017): EDA → statistical tests → K-Means segmentation → per-cluster regression (OLS + Ridge with K-Fold CV). Built with the PACE framework.

📊 Key results

Customer segmentation (K-Means, k=4)

Cluster	Label	Avg Profit	Profile
C0	Occasional Clients	+$290	Low volume, low value
C1	VIP Clients ⭐	+$5,912	High volume, high profit
C2	🚨 At-Risk Clients	−$212 LOSS	Over-discounted, erratic
C3	Loyal Clients	+$1,323	Regular and profitable

Profit model per cluster

Cluster	Model	R² test	vs Global OLS (R²=0.34)
C0 — Occasional	OLS	0.57	+0.23
C1 — VIP	OLS	0.99	+0.65
C2 — At-Risk	Ridge + K-Fold CV	0.32	+1.27 (rescued from −0.96)
C3 — Loyal	OLS	0.80	+0.46

Per-cluster modeling beats the global model on every single segment.

💰 Business impact

🎯 Cap Furniture discounts at 20% → quantified opportunity: ~$23,318 recoverable profit (317 loss orders × $73.56 average margin gap, validated by Chi² with Cramér's V).

🧪 Statistical validation

Three complementary tests confirm EDA findings on a rigorous basis:

Test	Purpose	Result
T-Test (Welch)	Discount impact on profit	p ≈ 0 · No discount: +$66.90 · With discount: −$6.66
ANOVA + Tukey HSD	Category effect on profit	Tech >> Furniture ≈ Office Supplies · gap = $70.05
Chi² + Cramér's V	Category × loss rate	Furniture shows +317 loss orders vs expected

🛠️ Methodology (PACE)

Stage	Action
Plan	Define segmentation + profit modeling goals · dataset audit
Analyze	Descriptive stats · distributions · outlier flagging (not removal) · discount/profit scatter · 3 statistical tests
Construct	Customer aggregation · RobustScaler · Elbow + Silhouette for k selection · K-Means(k=4) · Per-cluster regression (OLS for C0/C1/C3, Ridge + K-Fold CV for C2)
Execute	Final comparison table · business recommendations · export for Power BI

🧠 Key analytical decisions

Decision	Choice	Reasoning
Outlier treatment	Flag, not remove	Large B2B sales ($8,159) are legitimate, not errors
K-Means scaling	RobustScaler (not Standard)	Preserves real distances; not dominated by outliers
Ridge scaling	StandardScaler	L2 penalty requires equal feature scales
k selection	k=4	Elbow + Silhouette + business logic (4 interpretable profiles)
Regression scope	Per cluster (Approach B)	VIP and At-Risk have opposite profit dynamics → one global model fails
C2 model	Ridge + K-Fold CV	Initial OLS had R²=−0.96 (catastrophic overfitting) — Ridge shrinks unstable coefficients
SMOTE	Rejected	SMOTE is for classification, not regression. Overfitting solved with Ridge, not more data

See challenges_and_solutions.md for the full log of analytical decisions.

📁 Files

superstore_analysis.ipynb — Complete PACE analysis
superstore_analysis.py — Jupytext version (VS Code friendly)
challenges_and_solutions.md — Analytical decisions log

🧰 Tech stack

Python · pandas · numpy · scipy (T-Test, ANOVA, Chi²) · statsmodels (OLS, Tukey HSD) · scikit-learn (K-Means, Ridge, K-Fold CV) · matplotlib · seaborn

▶️ Run locally

git clone https://github.com/Khalifa160/superstore-segmentation.git
cd superstore-segmentation

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Download "Sample - Superstore.csv" (Kaggle: vivek468/superstore-dataset-final)
# and place it next to the notebook
jupyter notebook superstore_analysis.ipynb

🚀 Business recommendations

Cap Furniture discounts at 20% — quantified lever: ~$23,318 recoverable profit
Prioritize Technology category — $70 higher profit per order vs Furniture (Tukey HSD validated)
Segment marketing by cluster:
- VIP (C1): Exclusive loyalty programs, account managers
- Loyal (C3): Cross-sell and referral campaigns
- Occasional (C0): Reactivation promotions
- 🚨 At-Risk (C2): Emergency discount policy review — STOP over-discounting
Monitor At-Risk customers monthly — stricter approval for discounts > 30%

⚖️ Ethical considerations

Segmentation drives product and pricing strategy, not individual discrimination. Cluster labels describe business behavior, not customer worth. At-Risk clients should receive targeted support to become profitable — never worse service.

⚠️ Model limitations & next steps

Cluster 2 remains the hardest to model linearly. Next step: tree-based regressor (Random Forest, XGBoost) to capture non-linearities
No temporal features used (seasonality). Adding month/quarter could improve all models
Dataset is 2014–2017 — production deployment requires fresh data and online retraining

👤 Author

El khlife Messoud — Data Analyst, Nouakchott, Mauritania

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛒 Superstore Sales — Customer Segmentation & Profit Modeling

📊 Key results

Customer segmentation (K-Means, k=4)

Profit model per cluster

💰 Business impact

🧪 Statistical validation

🛠️ Methodology (PACE)

🧠 Key analytical decisions

📁 Files

🧰 Tech stack

▶️ Run locally

🚀 Business recommendations

⚖️ Ethical considerations

⚠️ Model limitations & next steps

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
challenges_and_solutions.md		challenges_and_solutions.md
requirements.txt		requirements.txt
superstore_analysis.ipynb		superstore_analysis.ipynb
superstore_analysis.py		superstore_analysis.py

Folders and files

Latest commit

History

Repository files navigation

🛒 Superstore Sales — Customer Segmentation & Profit Modeling

📊 Key results

Customer segmentation (K-Means, k=4)

Profit model per cluster

💰 Business impact

🧪 Statistical validation

🛠️ Methodology (PACE)

🧠 Key analytical decisions

📁 Files

🧰 Tech stack

▶️ Run locally

🚀 Business recommendations

⚖️ Ethical considerations

⚠️ Model limitations & next steps

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages