Skip to content

Khalifa160/superstore-segmentation

Repository files navigation

🛒 Superstore Sales — Customer Segmentation & Profit Modeling

End-to-end retail analytics project on the Sample Superstore dataset (9,994 transactions, 2014–2017): EDA → statistical tests → K-Means segmentationper-cluster regression (OLS + Ridge with K-Fold CV). Built with the PACE framework.

📊 Key results

Customer segmentation (K-Means, k=4)

Cluster Label Avg Profit Profile
C0 Occasional Clients +$290 Low volume, low value
C1 VIP Clients +$5,912 High volume, high profit
C2 🚨 At-Risk Clients −$212 LOSS Over-discounted, erratic
C3 Loyal Clients +$1,323 Regular and profitable

Profit model per cluster

Cluster Model R² test vs Global OLS (R²=0.34)
C0 — Occasional OLS 0.57 +0.23
C1 — VIP OLS 0.99 +0.65
C2 — At-Risk Ridge + K-Fold CV 0.32 +1.27 (rescued from −0.96)
C3 — Loyal OLS 0.80 +0.46

Per-cluster modeling beats the global model on every single segment.

💰 Business impact

🎯 Cap Furniture discounts at 20% → quantified opportunity: ~$23,318 recoverable profit (317 loss orders × $73.56 average margin gap, validated by Chi² with Cramér's V).

🧪 Statistical validation

Three complementary tests confirm EDA findings on a rigorous basis:

Test Purpose Result
T-Test (Welch) Discount impact on profit p ≈ 0 · No discount: +$66.90 · With discount: −$6.66
ANOVA + Tukey HSD Category effect on profit Tech >> Furniture ≈ Office Supplies · gap = $70.05
Chi² + Cramér's V Category × loss rate Furniture shows +317 loss orders vs expected

🛠️ Methodology (PACE)

Stage Action
Plan Define segmentation + profit modeling goals · dataset audit
Analyze Descriptive stats · distributions · outlier flagging (not removal) · discount/profit scatter · 3 statistical tests
Construct Customer aggregation · RobustScaler · Elbow + Silhouette for k selection · K-Means(k=4) · Per-cluster regression (OLS for C0/C1/C3, Ridge + K-Fold CV for C2)
Execute Final comparison table · business recommendations · export for Power BI

🧠 Key analytical decisions

Decision Choice Reasoning
Outlier treatment Flag, not remove Large B2B sales ($8,159) are legitimate, not errors
K-Means scaling RobustScaler (not Standard) Preserves real distances; not dominated by outliers
Ridge scaling StandardScaler L2 penalty requires equal feature scales
k selection k=4 Elbow + Silhouette + business logic (4 interpretable profiles)
Regression scope Per cluster (Approach B) VIP and At-Risk have opposite profit dynamics → one global model fails
C2 model Ridge + K-Fold CV Initial OLS had R²=−0.96 (catastrophic overfitting) — Ridge shrinks unstable coefficients
SMOTE Rejected SMOTE is for classification, not regression. Overfitting solved with Ridge, not more data

See challenges_and_solutions.md for the full log of analytical decisions.

📁 Files

🧰 Tech stack

Python · pandas · numpy · scipy (T-Test, ANOVA, Chi²) · statsmodels (OLS, Tukey HSD) · scikit-learn (K-Means, Ridge, K-Fold CV) · matplotlib · seaborn

▶️ Run locally

git clone https://github.com/Khalifa160/superstore-segmentation.git
cd superstore-segmentation

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Download "Sample - Superstore.csv" (Kaggle: vivek468/superstore-dataset-final)
# and place it next to the notebook
jupyter notebook superstore_analysis.ipynb

🚀 Business recommendations

  1. Cap Furniture discounts at 20% — quantified lever: ~$23,318 recoverable profit
  2. Prioritize Technology category — $70 higher profit per order vs Furniture (Tukey HSD validated)
  3. Segment marketing by cluster:
    • VIP (C1): Exclusive loyalty programs, account managers
    • Loyal (C3): Cross-sell and referral campaigns
    • Occasional (C0): Reactivation promotions
    • 🚨 At-Risk (C2): Emergency discount policy review — STOP over-discounting
  4. Monitor At-Risk customers monthly — stricter approval for discounts > 30%

⚖️ Ethical considerations

Segmentation drives product and pricing strategy, not individual discrimination. Cluster labels describe business behavior, not customer worth. At-Risk clients should receive targeted support to become profitable — never worse service.

⚠️ Model limitations & next steps

  • Cluster 2 remains the hardest to model linearly. Next step: tree-based regressor (Random Forest, XGBoost) to capture non-linearities
  • No temporal features used (seasonality). Adding month/quarter could improve all models
  • Dataset is 2014–2017 — production deployment requires fresh data and online retraining

👤 Author

El khlife Messoud — Data Analyst, Nouakchott, Mauritania

About

Customer segmentation (K-Means) + profit modeling (Ridge+CV) · PACE framework · Statistical validation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors