GPU-Accelerated Genetic Programming: Diabetes Health Indicator Classifier

This repository features a high-performance Genetic Programming (GP) implementation developed using the DEAP framework and optimized for GPUs via CuPy. The project focuses on evolving symbolic mathematical expressions to solve binary classification tasks on large-scale datasets.

🚀 Key Engineering Highlights

GPU Vectorized Evaluation: Replaced traditional row-by-row Python loops with vectorized GPU operations. By transposing the feature matrix in VRAM, the evolved trees process all 95,804 rows of data simultaneously as a single parallelized operation.

Bloat Control & Stability: Implemented a dual-constraint system to prevent "survival of the fattest." Trees are capped at a height of 17 and a total size of 150 nodes, ensuring the generated Python expressions stay within the 200-nested-parenthesis limit of the Python parser.

Memory Management: Specifically optimized for VRAM hardware by integrating manual cache clearing for both system RAM and GPU memory pools between evolutionary runs.

Evolutionary Strategy: Utilizes an elitism-based approach (ea_with_elitism), preserving the top 2 individuals per generation to guarantee that the maximum fitness never regresses.

📊 Dataset: Diabetes Health Indicators [CDC BRFSS]

The model is trained on a dataset containing health-related survey responses from 95,804 individuals. The objective is to predict whether an individual has diabetes based on 21 clinical and lifestyle risk factors.

Target Variable output: Binary classification (0 = No Diabetes, 1 = Diabetes/Pre-diabetes).

Key Feature Categories (21 Total) To solve this, the Genetic Programming algorithm evolved a symbolic expression using features across three critical domains:

Clinical Indicators: HighBP (Blood Pressure), HighChol (Cholesterol), BMI (Body Mass Index), GenHlth (General Health rating).

Lifestyle Factors: Smoker, HvyAlcoholConsump, PhysActivity, Fruits, Veggies.

Socioeconomic/Demographic: Age, Sex, Education, Income, AnyHealthcare.

Methodology

Massive Data Parallelism (GPU Vectorization) Evaluating a population of 1,000 individuals across 95,000 rows per generation is computationally expensive.

Solution: I moved the entire dataset to VRAM and transposed the feature matrix. This allowed the evolved mathematical trees to be evaluated as a single vectorized operation on the GTX 1650, processing all rows simultaneously.

Overcoming Python's Nesting Limits (Bloat Control) Genetic trees naturally "bloat," creating equations so deeply nested they exceed Python's 200-parenthesis limit and crash the parser.

Solution: I implemented a dual-limit strategy using staticLimit decorators to cap Tree Height (17) and Total Node Count (150), ensuring model stability without sacrificing complexity.

JIT Compilation in Restricted Sandboxes DEAP's compile function runs in a restricted environment that initially prevented CuPy from accessing the system imports needed for Just-In-Time (JIT) GPU kernel compilation.

Solution: I applied the pset.context to include builtins, granting CuPy the necessary permissions to compile the evolved logic directly on the GPU hardware

📈 Performance & Results

The model was evolved over 70 generations with a population of 1,000 individuals as the process proves significant improvement from the initial random population to the final optimized classifier:

Generation 0: Started with an average fitness of ~52%.

Generation 70: Converged to a peak accuracy of 73.97%.

Training Dataset: 95,804 samples.

Best Fitness (Accuracy): 73.97%

Model Strategy: Elitism-based evolutionary strategy(ea_with_elitism), preserving the top 2 individuals every generation to prevent regression.

Peak Accuracy: Successfully reached an accuracy of 73.97% on the training set.

Final Output: Classified 95,804 test samples into binary outputs (0 or 1) based on the best-evolved symbolic expression.

🛠️ Technical Stack

Language: Python 3.9

Evolutionary Logic: DEAP (Distributed Evolutionary Algorithms in Python)

GPU Acceleration: CuPy (optimized for CUDA)

Data Science: Pandas, NumPy

📂 Repository Contents

evolution_analysis: The primary research notebook containing the vectorized evaluation logic and evolutionary logs.

gp_classifier_gpu.py: Clean, modularized script for production runs.

submission.csv: Final classifications for the 95,804 test samples.

🧠 Optimization Insight: The "Sandbox" Fix

Genetic Programming in DEAP compiles trees using eval(), which runs in a restricted context. To enable CuPy's Just-In-Time (JIT) compilation of GPU kernels, the compilation context was modified to explicitly include builtins, allowing the GPU to dynamically compile the mathematical expressions evolved by the algorithm.

Conclusion:

Interpretable AI: Showcasing a core MedTech capability by evolving "white-box" symbolic expressions that provide transparent reasoning—a critical requirement for regulated healthcare sectors in Ireland.

Scalable MLOps: Demonstrated the ability to bridge the gap between complex evolutionary algorithms and modern hardware acceleration to solve high-dimensional data challenges.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs/screenshots		docs/screenshots
.gitignore		.gitignore
Readme.md		Readme.md
gp_classifier_gpu.py		gp_classifier_gpu.py
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Accelerated Genetic Programming: Diabetes Health Indicator Classifier

🚀 Key Engineering Highlights

📊 Dataset: Diabetes Health Indicators [CDC BRFSS]

Methodology

📈 Performance & Results

🛠️ Technical Stack

📂 Repository Contents

🧠 Optimization Insight: The "Sandbox" Fix

Conclusion:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Genetic Programming: Diabetes Health Indicator Classifier

🚀 Key Engineering Highlights

📊 Dataset: Diabetes Health Indicators [CDC BRFSS]

Methodology

📈 Performance & Results

🛠️ Technical Stack

📂 Repository Contents

🧠 Optimization Insight: The "Sandbox" Fix

Conclusion:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages