CUNYTechPrep · ahmadbasyouni10 · Sep 14, 2025 · Sep 22, 2025
diff --git a/Week-05-Business-Stats-Analytics/Exercise-DONT-EDIT-MAKE-COPY.ipynb b/Week-05-Business-Stats-Analytics/Exercise-DONT-EDIT-MAKE-COPY.ipynb
@@ -0,0 +1,383 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Week 5 Exercise: Business Statistics Analytics\n",
+        "## IBM HR Analytics Employee Attrition Dataset\n",
+        "\n",
+        "**Objective:** Apply statistical methods from Week 5 to analyze employee data and answer stakeholder-style business questions.\n",
+        "\n",
+        "**Dataset:** `data/WA_Fn-UseC_-HR-Employee-Attrition.csv`  \n",
+        "**Source:** IBM HR Analytics Employee Attrition (fictional dataset created by IBM data scientists)  \n",
+        "**Size:** 1,470 employees, 35 features  \n",
+        "**Business Context:** You're an HR data analyst tasked with understanding workforce patterns and providing actionable insights to leadership.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## Dataset Overview\n",
+        "\n",
+        "This dataset contains employee information including:\n",
+        "- **Demographics:** Age, Gender, MaritalStatus, EducationField\n",
+        "- **Job Information:** Department, JobRole, JobLevel, MonthlyIncome\n",
+        "- **Satisfaction Metrics:** JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance\n",
+        "- **Attrition:** Whether the employee left the company (Yes/No)\n",
+        "- **Work Patterns:** OverTime, BusinessTravel, DistanceFromHome\n",
+        "\n",
+        "### Key Variable Encodings:\n",
+        "- **Education:** 1='Below College', 2='College', 3='Bachelor', 4='Master', 5='Doctor'\n",
+        "- **JobSatisfaction:** 1='Low', 2='Medium', 3='High', 4='Very High'\n",
+        "- **WorkLifeBalance:** 1='Bad', 2='Good', 3='Better', 4='Best'\n",
+        "- **PerformanceRating:** 1='Low', 2='Good', 3='Excellent', 4='Outstanding'\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Setup & Guidelines\n",
+        "\n",
+        "**Required Libraries:** pandas, numpy, scipy.stats, statsmodels, matplotlib, seaborn\n",
+        "\n",
+        "**Statistical Guidelines:**\n",
+        "- Use **95% confidence intervals** unless stated otherwise\n",
+        "- For group comparisons, always **report sample sizes** (n)\n",
+        "- If any group has **n < 30**, mention it and proceed with caution\n",
+        "- For proportion tests, show **counts and proportions** clearly\n",
+        "- Always include **1-3 sentence interpretations** (statistical + business context)\n",
+        "- State your **assumptions** and **test choices**\n",
+        "\n",
+        "**Important Notes:**\n",
+        "- This dataset is clean with no missing values\n",
+        "- Some variables are encoded as numbers but represent categories (see above)\n",
+        "- Sample sizes are generally adequate (n ≥ 30) for most analyses\n",
+        "- Focus on **practical significance** in addition to statistical significance\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Setup Cell - Run this first\n",
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "import matplotlib.pyplot as plt\n",
+        "import seaborn as sns\n",
+        "from scipy import stats\n",
+        "import statsmodels.api as sm\n",
+        "import warnings\n",
+        "warnings.filterwarnings('ignore')\n",
+        "\n",
+        "# Load the dataset\n",
+        "df = pd.read_csv('data/WA_Fn-UseC_-HR-Employee-Attrition.csv')\n",
+        "\n",
+        "# Basic dataset info\n",
+        "print(f\"Dataset loaded: {df.shape[0]} employees, {df.shape[1]} features\")\n",
+        "print(f\"Missing values: {df.isnull().sum().sum()}\")\n",
+        "print(\"\\nFirst few rows:\")\n",
+        "df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q1: Workforce Overview\n",
+        "\n",
+        "**Business Question:** What's our current workforce composition and attrition situation?\n",
+        "\n",
+        "Report total headcount, attrition count, and overall attrition rate. Break down attrition by Department with a simple bar chart.\n",
+        "\n",
+        "**Skills:** Descriptive statistics, basic data exploration, visualization.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Calculate total headcount and attrition statistics\n",
+        "# TODO: Break down attrition by Department\n",
+        "# TODO: Create a bar chart showing attrition rate by department\n",
+        "# TODO: Provide 1-2 sentence business interpretation\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q2: Top Roles by Compensation Contribution\n",
+        "\n",
+        "**Business Question:** Which job roles contribute most to our total compensation spend?\n",
+        "\n",
+        "Rank `JobRole` by total `MonthlyIncome` (sum across all employees). Show the top 5 roles and their share of total compensation.\n",
+        "\n",
+        "**Skills:** Groupby operations, ranking, percentage calculations.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Group by JobRole and calculate total MonthlyIncome for each role\n",
+        "# TODO: Rank roles by total compensation and show top 5\n",
+        "# TODO: Calculate percentage share of total compensation\n",
+        "# TODO: Create visualization of top 5 roles\n",
+        "# TODO: Provide business insight\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q3: Education Field Compensation Analysis\n",
+        "\n",
+        "**Business Question:** Which education background commands the highest average salary?\n",
+        "\n",
+        "Find which `EducationField` has the highest average `MonthlyIncome`. Report sample size (n), mean, and standard deviation for each field. Create a 95% bootstrap confidence interval for the top two education fields.\n",
+        "\n",
+        "**Skills:** Groupby analysis, confidence intervals (bootstrap method).\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Group by EducationField and calculate mean, std, count for MonthlyIncome\n",
+        "# TODO: Identify the top 2 education fields by average income\n",
+        "# TODO: Create bootstrap confidence intervals for the top 2 fields\n",
+        "# TODO: Report n, mean, and 95% CI for each field\n",
+        "# TODO: Provide business interpretation\n",
+        "\n",
+        "# Bootstrap CI helper function (you can use this):\n",
+        "def bootstrap_ci_mean(series, n_boot=5000, ci=0.95, seed=42):\n",
+        "    \"\"\"Bootstrap confidence interval for mean\"\"\"\n",
+        "    np.random.seed(seed)\n",
+        "    boots = [np.random.choice(series.values, size=len(series), replace=True).mean() \n",
+        "             for _ in range(n_boot)]\n",
+        "    lo = np.quantile(boots, (1-ci)/2)\n",
+        "    hi = np.quantile(boots, 1-(1-ci)/2)\n",
+        "    return lo, hi\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q4: Sales Department Compensation Confidence Interval\n",
+        "\n",
+        "**Business Question:** What's our confidence interval for average Sales department compensation?\n",
+        "\n",
+        "Compute a 95% bootstrap confidence interval for mean `MonthlyIncome` in the Sales department. Report sample size (n), mean, confidence interval, and provide a 1-2 sentence business interpretation.\n",
+        "\n",
+        "**Skills:** Bootstrap confidence intervals, business interpretation of statistical results.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Filter data for Sales department\n",
+        "# TODO: Calculate bootstrap 95% CI for mean MonthlyIncome in Sales\n",
+        "# TODO: Report sample size, mean, and confidence interval\n",
+        "# TODO: Provide 1-2 sentence business interpretation\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q5: Gender Pay Gap Analysis (t-test)\n",
+        "\n",
+        "**Business Question:** Is there a statistically significant difference in compensation between male and female employees?\n",
+        "\n",
+        "Compare mean `MonthlyIncome` between Male vs Female employees using Welch's t-test. Report sample sizes (n), means, t-statistic, and p-value. Interpret the results practically, not just statistically.\n",
+        "\n",
+        "**Skills:** Two-sample t-test (Welch's), practical vs statistical significance.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Split MonthlyIncome data by Gender (Male vs Female)\n",
+        "# TODO: Perform Welch's t-test using stats.ttest_ind(equal_var=False)\n",
+        "# TODO: Report sample sizes, means, t-statistic, and p-value\n",
+        "# TODO: Interpret results both statistically and practically\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q6: Overtime vs Attrition Association (Chi-square)\n",
+        "\n",
+        "**Business Question:** Are employees who work overtime more likely to leave the company?\n",
+        "\n",
+        "Test whether `Attrition` (Yes/No) is associated with `OverTime` (Yes/No) using a chi-square test. Create a contingency table, calculate attrition rates for each group, then report χ², p-value, and business takeaway.\n",
+        "\n",
+        "**Skills:** Chi-square test of independence, contingency tables, interpreting associations.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Create contingency table with pd.crosstab(OverTime, Attrition)\n",
+        "# TODO: Calculate attrition rates for each OverTime group\n",
+        "# TODO: Perform chi-square test using stats.chi2_contingency()\n",
+        "# TODO: Report χ², p-value, and business interpretation\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q7: Business Travel vs Attrition (Chi-square)\n",
+        "\n",
+        "**Business Question:** Does the frequency of business travel affect employee retention?\n",
+        "\n",
+        "Test whether `BusinessTravel` category is associated with `Attrition`. Create a contingency table, perform chi-square test, and report χ², p-value, and interpretation.\n",
+        "\n",
+        "**Skills:** Multi-category chi-square test, interpreting business travel impact on retention.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Create contingency table with pd.crosstab(BusinessTravel, Attrition)\n",
+        "# TODO: Perform chi-square test using stats.chi2_contingency()\n",
+        "# TODO: Report χ², p-value, and interpretation\n",
+        "# TODO: Discuss impact of travel frequency on retention\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q8: Distance & Tenure Effects (t-test + Correlation)\n",
+        "\n",
+        "**Business Question:** Do employees who live farther away or have different tenure patterns show different attrition behavior?\n",
+        "\n",
+        "**Part A:** Compare `DistanceFromHome` between employees who left (Attrition=Yes) vs stayed (Attrition=No) using Welch's t-test.\n",
+        "\n",
+        "**Part B:** Compute Pearson correlation between `Age` and `YearsAtCompany`. Create a scatter plot with a fitted trend line.\n",
+        "\n",
+        "**Skills:** T-test for continuous variables, correlation analysis, scatter plots with trend lines.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO Part A: Compare DistanceFromHome between Attrition groups using Welch's t-test\n",
+        "# TODO: Report means, t-statistic, p-value for distance comparison\n",
+        "\n",
+        "# TODO Part B: Calculate Pearson correlation between Age and YearsAtCompany\n",
+        "# TODO: Create scatter plot with fitted trend line using np.polyfit()\n",
+        "# TODO: Report correlation coefficient and p-value\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q9: Income Drivers (OLS Regression)\n",
+        "\n",
+        "**Business Question:** What factors most strongly predict employee compensation?\n",
+        "\n",
+        "Fit an OLS regression model: `MonthlyIncome ~ JobLevel + YearsAtCompany + OverTime + StockOptionLevel + C(Department)`\n",
+        "\n",
+        "Create dummy variables as needed, add constant with `sm.add_constant()`, fit the model. Report coefficients, 95% confidence intervals, R², and 2-3 business-readable insights about which factors drive compensation.\n",
+        "\n",
+        "**Skills:** Multiple linear regression, dummy variables, interpreting coefficients in business context.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# TODO: Create dummy variables for categorical predictors (OverTime, Department)\n",
+        "# TODO: Set up regression: MonthlyIncome ~ JobLevel + YearsAtCompany + OverTime + StockOptionLevel + Department\n",
+        "# TODO: Add constant with sm.add_constant() and fit OLS model\n",
+        "# TODO: Report coefficients, 95% CIs, R², and business insights\n",
+        "\n",
+        "# Your code here:\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Q10: Business Recommendation\n",
+        "\n",
+        "**Business Question:** Based on your analysis, what is one actionable strategy leadership should consider?\n",
+        "\n",
+        "Based on your findings from Q1-Q9, provide one concrete business recommendation (e.g., targeted retention focus for a high risk subgroup, overtime policy review, compensation adjustments). \n",
+        "\n",
+        "Your recommendation should be:\n",
+        "- Grounded in your statistical findings\n",
+        "- Specific and actionable\n",
+        "- Include potential impact/next steps\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.9"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}