This project develops a machine learning framework to predict the risk of heart attack using data from the 2011 Behavioral Risk Factor Surveillance System (BRFSS).
The dataset includes over 500,000 survey responses and 450 variables describing health conditions, behaviors and demographics.
The objective is to identify individuals at risk of heart attack and evaluate predictive models under strong class imbalance conditions.
Source: CDC – BRFSS 2011
- 506,467 records
- 450 variables
- Binary target variable:
CVDINFR4(ever diagnosed with heart attack)
The dataset contains:
- Questionnaire variables
- Computed health indicators
- Demographic variables
- Clinical history variables
After preprocessing and feature selection, 56 independent variables were retained.
The minority class represents approximately 6% of the dataset.
- Removal of structurally dependent variables
- Conversion of symbolic numeric encodings
- Filtering records with missing target variable
- Stratified train/test split (80/20, seed = 1618)
Isolation Forest (H2O implementation):
- Estimated contamination α = 0.0001
- 45 training records removed
- 12 test records removed
- IterativeImputer (scikit-learn)
- MinMax scaling prior to imputation
- Nominal variables encoded with OneHotEncoder
- High-cardinality variable encoded with TargetEncoder
- Tree-based feature importance (Random Forest)
- Impurity-based ranking
- Correlation filtering (ρ > 0.75 removed)
- Dimensionality reduced from 142 to 56 features
Models trained using H2O AutoML:
- Logistic Regression (GLM)
- Random Forest
- Gradient Boosting Machine
- Feedforward Neural Networks
- Stacked Ensembles
Four imbalance-handling strategies were tested:
- No rebalancing
- Undersampling + weighting
- Class weighting
- Oversampling
Sorting metric: Area Under the Precision-Recall Curve
Across experiments:
- Accuracy ≈ 0.937–0.938
- AUC ≈ 0.90
- F1 ≈ 0.50
- Recall prioritized over precision
Key observation:
Accuracy remained artificially high due to imbalance.
Precision-Recall analysis provided more meaningful evaluation.
The selected model improves positive detection performance by approximately 7–8x compared to random classification.
- Class imbalance had limited impact on model ranking.
- Tree-based ensembles consistently performed best.
- Strong predictors include prior cardiovascular conditions and general health indicators.
- High recall is clinically preferable despite moderate precision.
- KNIME
- H2O AutoML
- Python integration (scikit-learn)
- Isolation Forest
- IterativeImputer
- Target Encoding
- One-Hot Encoding
Daniele Lepre
Alice Anna Maria Brunazzi
Alessandro Della Beffa
