Skip to content

dlepre01/BRFSS--Heart-attack-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

BRFSS – Heart Attack Risk Prediction

Heart attack

Overview

This project develops a machine learning framework to predict the risk of heart attack using data from the 2011 Behavioral Risk Factor Surveillance System (BRFSS).

The dataset includes over 500,000 survey responses and 450 variables describing health conditions, behaviors and demographics.

The objective is to identify individuals at risk of heart attack and evaluate predictive models under strong class imbalance conditions.


Dataset

Source: CDC – BRFSS 2011

  • 506,467 records
  • 450 variables
  • Binary target variable: CVDINFR4 (ever diagnosed with heart attack)

The dataset contains:

  • Questionnaire variables
  • Computed health indicators
  • Demographic variables
  • Clinical history variables

After preprocessing and feature selection, 56 independent variables were retained.

The minority class represents approximately 6% of the dataset.


Data Preprocessing

Data Cleaning

  • Removal of structurally dependent variables
  • Conversion of symbolic numeric encodings
  • Filtering records with missing target variable
  • Stratified train/test split (80/20, seed = 1618)

Outlier Treatment

Isolation Forest (H2O implementation):

  • Estimated contamination α = 0.0001
  • 45 training records removed
  • 12 test records removed

Missing Value Handling

  • IterativeImputer (scikit-learn)
  • MinMax scaling prior to imputation
  • Nominal variables encoded with OneHotEncoder
  • High-cardinality variable encoded with TargetEncoder

Feature Selection

  • Tree-based feature importance (Random Forest)
  • Impurity-based ranking
  • Correlation filtering (ρ > 0.75 removed)
  • Dimensionality reduced from 142 to 56 features

Modeling Strategy

Models trained using H2O AutoML:

  • Logistic Regression (GLM)
  • Random Forest
  • Gradient Boosting Machine
  • Feedforward Neural Networks
  • Stacked Ensembles

Four imbalance-handling strategies were tested:

  1. No rebalancing
  2. Undersampling + weighting
  3. Class weighting
  4. Oversampling

Sorting metric: Area Under the Precision-Recall Curve


Model Performance

Across experiments:

  • Accuracy ≈ 0.937–0.938
  • AUC ≈ 0.90
  • F1 ≈ 0.50
  • Recall prioritized over precision

Key observation: Accuracy remained artificially high due to imbalance.
Precision-Recall analysis provided more meaningful evaluation.

The selected model improves positive detection performance by approximately 7–8x compared to random classification.


Key Insights

  • Class imbalance had limited impact on model ranking.
  • Tree-based ensembles consistently performed best.
  • Strong predictors include prior cardiovascular conditions and general health indicators.
  • High recall is clinically preferable despite moderate precision.

Tech Stack

  • KNIME
  • H2O AutoML
  • Python integration (scikit-learn)
  • Isolation Forest
  • IterativeImputer
  • Target Encoding
  • One-Hot Encoding

Authors

Daniele Lepre
Alice Anna Maria Brunazzi
Alessandro Della Beffa

About

Heart attack risk prediction using the 2011 BRFSS dataset (506k records) with H2O AutoML, imbalance handling strategies, feature selection and precision-recall optimization for clinical risk detection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors