Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
383 changes: 383 additions & 0 deletions Week-05-Business-Stats-Analytics/Exercise-DONT-EDIT-MAKE-COPY.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,383 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Week 5 Exercise: Business Statistics Analytics\n",
"## IBM HR Analytics Employee Attrition Dataset\n",
"\n",
"**Objective:** Apply statistical methods from Week 5 to analyze employee data and answer stakeholder-style business questions.\n",
"\n",
"**Dataset:** `data/WA_Fn-UseC_-HR-Employee-Attrition.csv` \n",
"**Source:** IBM HR Analytics Employee Attrition (fictional dataset created by IBM data scientists) \n",
"**Size:** 1,470 employees, 35 features \n",
"**Business Context:** You're an HR data analyst tasked with understanding workforce patterns and providing actionable insights to leadership.\n",
"\n",
"---\n",
"\n",
"## Dataset Overview\n",
"\n",
"This dataset contains employee information including:\n",
"- **Demographics:** Age, Gender, MaritalStatus, EducationField\n",
"- **Job Information:** Department, JobRole, JobLevel, MonthlyIncome\n",
"- **Satisfaction Metrics:** JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance\n",
"- **Attrition:** Whether the employee left the company (Yes/No)\n",
"- **Work Patterns:** OverTime, BusinessTravel, DistanceFromHome\n",
"\n",
"### Key Variable Encodings:\n",
"- **Education:** 1='Below College', 2='College', 3='Bachelor', 4='Master', 5='Doctor'\n",
"- **JobSatisfaction:** 1='Low', 2='Medium', 3='High', 4='Very High'\n",
"- **WorkLifeBalance:** 1='Bad', 2='Good', 3='Better', 4='Best'\n",
"- **PerformanceRating:** 1='Low', 2='Good', 3='Excellent', 4='Outstanding'\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup & Guidelines\n",
"\n",
"**Required Libraries:** pandas, numpy, scipy.stats, statsmodels, matplotlib, seaborn\n",
"\n",
"**Statistical Guidelines:**\n",
"- Use **95% confidence intervals** unless stated otherwise\n",
"- For group comparisons, always **report sample sizes** (n)\n",
"- If any group has **n < 30**, mention it and proceed with caution\n",
"- For proportion tests, show **counts and proportions** clearly\n",
"- Always include **1-3 sentence interpretations** (statistical + business context)\n",
"- State your **assumptions** and **test choices**\n",
"\n",
"**Important Notes:**\n",
"- This dataset is clean with no missing values\n",
"- Some variables are encoded as numbers but represent categories (see above)\n",
"- Sample sizes are generally adequate (n ≥ 30) for most analyses\n",
"- Focus on **practical significance** in addition to statistical significance\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Setup Cell - Run this first\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from scipy import stats\n",
"import statsmodels.api as sm\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# Load the dataset\n",
"df = pd.read_csv('data/WA_Fn-UseC_-HR-Employee-Attrition.csv')\n",
"\n",
"# Basic dataset info\n",
"print(f\"Dataset loaded: {df.shape[0]} employees, {df.shape[1]} features\")\n",
"print(f\"Missing values: {df.isnull().sum().sum()}\")\n",
"print(\"\\nFirst few rows:\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q1: Workforce Overview\n",
"\n",
"**Business Question:** What's our current workforce composition and attrition situation?\n",
"\n",
"Report total headcount, attrition count, and overall attrition rate. Break down attrition by Department with a simple bar chart.\n",
"\n",
"**Skills:** Descriptive statistics, basic data exploration, visualization.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Calculate total headcount and attrition statistics\n",
"# TODO: Break down attrition by Department\n",
"# TODO: Create a bar chart showing attrition rate by department\n",
"# TODO: Provide 1-2 sentence business interpretation\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q2: Top Roles by Compensation Contribution\n",
"\n",
"**Business Question:** Which job roles contribute most to our total compensation spend?\n",
"\n",
"Rank `JobRole` by total `MonthlyIncome` (sum across all employees). Show the top 5 roles and their share of total compensation.\n",
"\n",
"**Skills:** Groupby operations, ranking, percentage calculations.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Group by JobRole and calculate total MonthlyIncome for each role\n",
"# TODO: Rank roles by total compensation and show top 5\n",
"# TODO: Calculate percentage share of total compensation\n",
"# TODO: Create visualization of top 5 roles\n",
"# TODO: Provide business insight\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q3: Education Field Compensation Analysis\n",
"\n",
"**Business Question:** Which education background commands the highest average salary?\n",
"\n",
"Find which `EducationField` has the highest average `MonthlyIncome`. Report sample size (n), mean, and standard deviation for each field. Create a 95% bootstrap confidence interval for the top two education fields.\n",
"\n",
"**Skills:** Groupby analysis, confidence intervals (bootstrap method).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Group by EducationField and calculate mean, std, count for MonthlyIncome\n",
"# TODO: Identify the top 2 education fields by average income\n",
"# TODO: Create bootstrap confidence intervals for the top 2 fields\n",
"# TODO: Report n, mean, and 95% CI for each field\n",
"# TODO: Provide business interpretation\n",
"\n",
"# Bootstrap CI helper function (you can use this):\n",
"def bootstrap_ci_mean(series, n_boot=5000, ci=0.95, seed=42):\n",
" \"\"\"Bootstrap confidence interval for mean\"\"\"\n",
" np.random.seed(seed)\n",
" boots = [np.random.choice(series.values, size=len(series), replace=True).mean() \n",
" for _ in range(n_boot)]\n",
" lo = np.quantile(boots, (1-ci)/2)\n",
" hi = np.quantile(boots, 1-(1-ci)/2)\n",
" return lo, hi\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q4: Sales Department Compensation Confidence Interval\n",
"\n",
"**Business Question:** What's our confidence interval for average Sales department compensation?\n",
"\n",
"Compute a 95% bootstrap confidence interval for mean `MonthlyIncome` in the Sales department. Report sample size (n), mean, confidence interval, and provide a 1-2 sentence business interpretation.\n",
"\n",
"**Skills:** Bootstrap confidence intervals, business interpretation of statistical results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Filter data for Sales department\n",
"# TODO: Calculate bootstrap 95% CI for mean MonthlyIncome in Sales\n",
"# TODO: Report sample size, mean, and confidence interval\n",
"# TODO: Provide 1-2 sentence business interpretation\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q5: Gender Pay Gap Analysis (t-test)\n",
"\n",
"**Business Question:** Is there a statistically significant difference in compensation between male and female employees?\n",
"\n",
"Compare mean `MonthlyIncome` between Male vs Female employees using Welch's t-test. Report sample sizes (n), means, t-statistic, and p-value. Interpret the results practically, not just statistically.\n",
"\n",
"**Skills:** Two-sample t-test (Welch's), practical vs statistical significance.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Split MonthlyIncome data by Gender (Male vs Female)\n",
"# TODO: Perform Welch's t-test using stats.ttest_ind(equal_var=False)\n",
"# TODO: Report sample sizes, means, t-statistic, and p-value\n",
"# TODO: Interpret results both statistically and practically\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q6: Overtime vs Attrition Association (Chi-square)\n",
"\n",
"**Business Question:** Are employees who work overtime more likely to leave the company?\n",
"\n",
"Test whether `Attrition` (Yes/No) is associated with `OverTime` (Yes/No) using a chi-square test. Create a contingency table, calculate attrition rates for each group, then report χ², p-value, and business takeaway.\n",
"\n",
"**Skills:** Chi-square test of independence, contingency tables, interpreting associations.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Create contingency table with pd.crosstab(OverTime, Attrition)\n",
"# TODO: Calculate attrition rates for each OverTime group\n",
"# TODO: Perform chi-square test using stats.chi2_contingency()\n",
"# TODO: Report χ², p-value, and business interpretation\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q7: Business Travel vs Attrition (Chi-square)\n",
"\n",
"**Business Question:** Does the frequency of business travel affect employee retention?\n",
"\n",
"Test whether `BusinessTravel` category is associated with `Attrition`. Create a contingency table, perform chi-square test, and report χ², p-value, and interpretation.\n",
"\n",
"**Skills:** Multi-category chi-square test, interpreting business travel impact on retention.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Create contingency table with pd.crosstab(BusinessTravel, Attrition)\n",
"# TODO: Perform chi-square test using stats.chi2_contingency()\n",
"# TODO: Report χ², p-value, and interpretation\n",
"# TODO: Discuss impact of travel frequency on retention\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q8: Distance & Tenure Effects (t-test + Correlation)\n",
"\n",
"**Business Question:** Do employees who live farther away or have different tenure patterns show different attrition behavior?\n",
"\n",
"**Part A:** Compare `DistanceFromHome` between employees who left (Attrition=Yes) vs stayed (Attrition=No) using Welch's t-test.\n",
"\n",
"**Part B:** Compute Pearson correlation between `Age` and `YearsAtCompany`. Create a scatter plot with a fitted trend line.\n",
"\n",
"**Skills:** T-test for continuous variables, correlation analysis, scatter plots with trend lines.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO Part A: Compare DistanceFromHome between Attrition groups using Welch's t-test\n",
"# TODO: Report means, t-statistic, p-value for distance comparison\n",
"\n",
"# TODO Part B: Calculate Pearson correlation between Age and YearsAtCompany\n",
"# TODO: Create scatter plot with fitted trend line using np.polyfit()\n",
"# TODO: Report correlation coefficient and p-value\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q9: Income Drivers (OLS Regression)\n",
"\n",
"**Business Question:** What factors most strongly predict employee compensation?\n",
"\n",
"Fit an OLS regression model: `MonthlyIncome ~ JobLevel + YearsAtCompany + OverTime + StockOptionLevel + C(Department)`\n",
"\n",
"Create dummy variables as needed, add constant with `sm.add_constant()`, fit the model. Report coefficients, 95% confidence intervals, R², and 2-3 business-readable insights about which factors drive compensation.\n",
"\n",
"**Skills:** Multiple linear regression, dummy variables, interpreting coefficients in business context.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Create dummy variables for categorical predictors (OverTime, Department)\n",
"# TODO: Set up regression: MonthlyIncome ~ JobLevel + YearsAtCompany + OverTime + StockOptionLevel + Department\n",
"# TODO: Add constant with sm.add_constant() and fit OLS model\n",
"# TODO: Report coefficients, 95% CIs, R², and business insights\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Q10: Business Recommendation\n",
"\n",
"**Business Question:** Based on your analysis, what is one actionable strategy leadership should consider?\n",
"\n",
"Based on your findings from Q1-Q9, provide one concrete business recommendation (e.g., targeted retention focus for a high risk subgroup, overtime policy review, compensation adjustments). \n",
"\n",
"Your recommendation should be:\n",
"- Grounded in your statistical findings\n",
"- Specific and actionable\n",
"- Include potential impact/next steps\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading