BRFSS – Heart Attack Risk Prediction

Overview

This project develops a machine learning framework to predict the risk of heart attack using data from the 2011 Behavioral Risk Factor Surveillance System (BRFSS).

The dataset includes over 500,000 survey responses and 450 variables describing health conditions, behaviors and demographics.

The objective is to identify individuals at risk of heart attack and evaluate predictive models under strong class imbalance conditions.

Dataset

Source: CDC – BRFSS 2011

506,467 records
450 variables
Binary target variable: CVDINFR4 (ever diagnosed with heart attack)

The dataset contains:

Questionnaire variables
Computed health indicators
Demographic variables
Clinical history variables

After preprocessing and feature selection, 56 independent variables were retained.

The minority class represents approximately 6% of the dataset.

Data Preprocessing

Data Cleaning

Removal of structurally dependent variables
Conversion of symbolic numeric encodings
Filtering records with missing target variable
Stratified train/test split (80/20, seed = 1618)

Outlier Treatment

Isolation Forest (H2O implementation):

Estimated contamination α = 0.0001
45 training records removed
12 test records removed

Missing Value Handling

IterativeImputer (scikit-learn)
MinMax scaling prior to imputation
Nominal variables encoded with OneHotEncoder
High-cardinality variable encoded with TargetEncoder

Feature Selection

Tree-based feature importance (Random Forest)
Impurity-based ranking
Correlation filtering (ρ > 0.75 removed)
Dimensionality reduced from 142 to 56 features

Modeling Strategy

Models trained using H2O AutoML:

Logistic Regression (GLM)
Random Forest
Gradient Boosting Machine
Feedforward Neural Networks
Stacked Ensembles

Four imbalance-handling strategies were tested:

No rebalancing
Undersampling + weighting
Class weighting
Oversampling

Sorting metric: Area Under the Precision-Recall Curve

Model Performance

Across experiments:

Accuracy ≈ 0.937–0.938
AUC ≈ 0.90
F1 ≈ 0.50
Recall prioritized over precision

Key observation: Accuracy remained artificially high due to imbalance.
Precision-Recall analysis provided more meaningful evaluation.

The selected model improves positive detection performance by approximately 7–8x compared to random classification.

Key Insights

Class imbalance had limited impact on model ranking.
Tree-based ensembles consistently performed best.
Strong predictors include prior cardiovascular conditions and general health indicators.
High recall is clinically preferable despite moderate precision.

Tech Stack

KNIME
H2O AutoML
Python integration (scikit-learn)
Isolation Forest
IterativeImputer
Target Encoding
One-Hot Encoding

Authors

Daniele Lepre
Alice Anna Maria Brunazzi
Alessandro Della Beffa

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
BRFSS- Heart attack prediction.pdf		BRFSS- Heart attack prediction.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BRFSS – Heart Attack Risk Prediction

Overview

Dataset

Data Preprocessing

Data Cleaning

Outlier Treatment

Missing Value Handling

Feature Selection

Modeling Strategy

Model Performance

Key Insights

Tech Stack

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BRFSS – Heart Attack Risk Prediction

Overview

Dataset

Data Preprocessing

Data Cleaning

Outlier Treatment

Missing Value Handling

Feature Selection

Modeling Strategy

Model Performance

Key Insights

Tech Stack

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages