HR Analytics: The $12.25M Hidden Cost of Employee Turnover
A data-driven analysis quantifying employee attrition and proving the ROI of retention programs. Stack: Python (pandas, scipy, scikit-learn) · SQL · Power BI
How much does employee turnover really cost a company — and can a retention program pay for itself?
This project analyzes the IBM HR Analytics dataset (1,470 employees, 35 features) to quantify the financial impact of attrition and identify the operational levers that drive it.
| Metric | Value | Insight |
|---|---|---|
| Annual Cost of Turnover | $12.25M | Hidden in HR, recruiting, and productivity budgets |
| Cost per Departure | $52k | ~90% of annual salary (industry benchmark) |
| Attrition Rate | 16.1% | Above the 10-15% market benchmark |
| High-Risk Employees | 17 | Identified through composite risk scoring |
| Retention Program ROI | 819% | $200k investment → $1.84M annual savings |
| Payback Period | 1.3 months | Faster than most marketing investments |
Employees working overtime are 3x more likely to quit. Despite being only 28% of the workforce, they account for 54% of all departures.
Recommended action: Strict overtime policy with quarterly review.
Out of 3 departments, two concentrate almost all the financial damage:
- R&D: $5.9M annual turnover cost
- Sales: $5.87M annual turnover cost
- HR: ~$0.5M
Recommended action: Focus 100% of retention programs on R&D and Sales first.
29.82% of new hires leave within their first 2 years — nearly 4x higher than employees with 10+ years (8.13%).
Recommended action: Mandatory mentoring program for the first 24 months.
These findings broke common assumptions and are worth highlighting:
Employees who rate their work-life balance as "4 — excellent" have a 17.65% attrition rate — higher than those who rate it 2 or 3 (~16%).
Hypothesis: too much balance may signal under-engagement.
- Low hike (<12%): 19.52% attrition (expected)
- Top hike (20%+): 16.37% attrition (still high!)
Hypothesis: large raises may be "retention counter-offers" that come too late, or signal increased pressure.
The HR department has an attrition rate of 19.05% — the very team responsible for retention is failing to retain itself.
Raw CSV (Kaggle)
↓
Notebook 1: Data Cleaning + EDA (pandas, seaborn)
↓
Notebook 2: Statistical Validation (scipy)
↓
Notebook 3: Cost Analysis + Risk Scoring (pandas, numpy)
↓
Processed CSVs (fact + dim tables)
↓
Power BI Dashboard (5 pages, 15+ DAX measures)
- Mann-Whitney U: salary differences between leavers and stayers
- Chi-square + Cramér's V: categorical relationships (Overtime × Attrition)
- Multivariate ranking: identified top 5 predictors of turnover
Based on SHRM and Gallup research, the turnover cost was estimated as 90% of annual salary, broken down into:
- Recruitment (20%)
- Onboarding (15%)
- Productivity loss (40%)
- Intangible cost (15%)
Each active employee received a 0-10 risk score based on weighted factors:
- Overtime (+2 points)
- Distance >20km (+2 points)
- Low satisfaction (+2 points)
- Low salary (+1 point)
- Tenure <3 years (+1 point)
- Age <30 (+1 point)
Resulting classification:
- HIGH RISK: score ≥ 6 → 17 employees identified
- MEDIUM RISK: score 3-5
- LOW RISK: score 0-2
hr-analytics/
├── README.md
├── .gitignore
├── data/
│ ├── raw/ (not pushed — see .gitignore)
│ └── processed/
│ ├── hr_employees_processed.csv
│ ├── hr_at_risk.csv
│ ├── financial_summary.csv
│ ├── statistical_ranking.csv
│ ├── numeric_tests.csv
│ └── categorical_tests.csv
├── notebooks/
│ ├── 01_data_cleaning_eda.ipynb
│ ├── 02_statistical_analysis.ipynb
│ └── 03_cost_analysis.ipynb
├── sql/
│ └── hr_analysis.sql
├── dashboard/
│ └── hr_dashboard.pbix
└── docs/
├── 01_executive_summary.png
├── 02_demographics.png
├── 03_compensation.png
├── 04_satisfaction.png
├── 05_cost_action.png
└── hr_dashboard.pdf
- Python 3.9+
- Power BI Desktop (free)
- Jupyter Notebook
git clone https://github.com/igorrocha-tech/hr-analytics.git
cd hr-analytics
# Install Python dependencies
pip install pandas numpy matplotlib seaborn scipy scikit-learn jupyter
# Download the dataset from Kaggle (link in /data/raw/README.md)
# Place WA_Fn-UseC_-HR-Employee-Attrition.csv in data/raw/
# Run the notebooks in order
jupyter notebook01_data_cleaning_eda.ipynb→ generateshr_employees_processed.csv02_statistical_analysis.ipynb→ generates statistical results03_cost_analysis.ipynb→ generateshr_at_risk.csv+financial_summary.csv- Open
dashboard/hr_dashboard.pbixin Power BI Desktop
The hero number ($12.25M) reframes turnover from "HR problem" to "CFO problem."
Reveals the critical 0-2 year window, Single status as risk factor, and HR's own retention issue.
The Salary × Tenure scatter shows the "danger zone": low salary + short tenure concentrates attrition.
Home of the 3x overtime multiplier insight — the single most actionable finding.
Three retention scenarios with quantified savings, top 17 high-risk employees, and 5-point action plan.
-
Overtime Policy (impact: $$$$) Implement strict overtime limits with quarterly review. Expected: 5-10% turnover reduction.
-
High-Risk 1:1s (impact: $$$) Schedule conversations with the 17 high-risk employees within 30 days.
-
R&D + Sales Focus (impact: $$$) These departments concentrate 96% of total turnover cost. Prioritize all retention programs here.
-
Mentoring (0-2 years) (impact: $$) 30% of new hires leave in the first 2 years. Implement onboarding mentorship.
-
Engagement Survey (impact: $) Quarterly pulse surveys to catch issues early.
| Category | Skills |
|---|---|
| Python | pandas, numpy, matplotlib, seaborn, scipy.stats, scikit-learn |
| Statistical Analysis | Hypothesis testing (t-test, chi-square, Mann-Whitney), correlation analysis, Cramér's V |
| Data Engineering | Data cleaning, feature engineering, dimensional modeling, CSV transformations |
| Power BI | 15+ DAX measures, conditional formatting, treemap, what-if scenarios, calculated columns |
| Business Analysis | Cost modeling, ROI calculation, risk scoring, scenario simulation |
| Storytelling | Executive dashboard design, actionable recommendations, counter-intuitive insights |
Igor Rocha Nunes — Data Analyst
- 💼 LinkedIn: igor-rocha-nunes
- 🐙 GitHub: igorrocha-tech
- 📊 Other portfolio project: Olist E-commerce Analysis
Open to remote freelance and contract analytics projects.
This project uses the publicly available IBM HR Analytics Dataset from Kaggle (synthetic data). Analysis, code, and dashboard are licensed under MIT — feel free to learn from and adapt.
This is a portfolio project using synthetic data from IBM. Cost figures, ROI calculations, and risk scores are based on industry benchmarks (SHRM, Gallup) applied to the dataset values. The analysis methodology is real and reproducible; the specific numbers should be interpreted as illustrative of the approach, not as findings from a real company.




