Medical Image Classification Framework with Synthetic Data Integration

This repository contains the evaluation framework and classification pipeline developed for my Master's Thesis at TU Ilmenau. The project focuses on assessing the downstream classification performance of an EfficientNet-B0 classifier when augmented with synthetic medical images generated via fine-tuned Diffusion Models (Stable Diffusion + LoRA).

The core objective is to analyze the statistical utility of generative augmentation under different dataset blending strategies (proportional ratio scaling vs. rigid minority class balancing).

📌 Core Architecture & Methodology

The classification pipeline is designed to enforce maximum mathematical rigor and prevent overoptimistic evaluation metrics common in medical imaging applications.

Key Engineering Features:

Strict Lesion-Level Splitting: To avoid critical data leakage, the dataset splitting mechanism ensures that multiple images belonging to the same skin lesion (from the BCN20000 dataset) are completely isolated within either the training or the testing fold. No identical lesion data spans across the validation sets.
5-Fold Cross-Validation: Built-in programmatic execution of stratified 5-fold cross-validation to ensure stable and reproducible performance tracking.
Fold-Specific Synthetic Ingestion: The architecture is decoupled to ingest precisely mapped synthetic image subsets into each specific training fold, maintaining absolute isolation of the original real validation subsets.

🛠 Repository Structure

The framework follows a modular, scalable ML-engineering layout, grouping core functionalities into specialized directories:

main.py: The primary execution entry point. Orchestrates the 5-fold cross-validation pipeline, triggers training loops, and logs execution stages.
config.py: Centralized configuration management. Controls structural constants, hyperparameter definitions (learning rates, batch allocations, optimizer steps), and directory path mappings.
eval.ipynb: Downstream analytics workspace. Used for performance inspection, computing confusion matrices, and generating aggregate metric plots across completed experiment runs.

📂 Core Submodules:

`data/` — Data Engineering & Pipeline Management

datamodule.py: Encapsulates high-level PyTorch Lightning-style or custom dataset piping, orchestrating inputs for both training routines and test evaluations.
dataset.py: Low-level PyTorch Dataset override custom-tailored for BCN20000 image streams. Handles raw image I/O, label dictionary alignments, and metadata column parsing.
splits.py: Crucial Isolation Layer. Houses the logic responsible for the lesion-level dataset partition. Guarantees that no patient-specific lesion IDs overlap between training folds and validation subsets, completely preventing target data leakage.
transforms.py: Houses pixel-level data augmentation techniques (e.g., color jitters, geometric rotations, spatial normalizations) used to artificially expand the original training distributions.

`utils/` — System Hooks & Reproducibility

seed.py: Centralized deterministic configuration hook. Programmatically seeds global random states across numpy, random, and torch to guarantee 100% execution replicability.
plots.py: Automation utility script for rendering training trajectories, learning curves, loss convergence tracking, and exporting performance evaluations into static vector graphics.

`models/` — Neural Network Topologies

Dedicated space wrapping the EfficientNet-B0 architecture, managing parameter freezing protocols, and adapting the final fully connected classification head layers to match the targeted clinical diagnostics classes.

Execution & Training Setup

The training routing is heavily automated. Ensure your paths and hyperparameters are correctly initialized inside config.py before running the framework.

Running the pipeline:

To train the classifier across the pre-defined folds with custom synthetic blend ratios:

python main.py

📜 Academic Context & Data Availability

This project represents the core classifier validation engine of a Master's Thesis completed at the Technische Universität Ilmenau.

Data Access: Source dermoscopic data originates from the public BCN20000 dataset hosted on the ISIC Archive.
Replicability: Code adjustments are generic enough to support external ISIC-formatted classification targets. Synthetic data integration formats follow standard ISIC schema conventions.

For professional networking, data science collaboration, or industrial research opportunities in MedTech/AI, let's connect on LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
eval.ipynb		eval.ipynb
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Image Classification Framework with Synthetic Data Integration

📌 Core Architecture & Methodology

Key Engineering Features:

🛠 Repository Structure

📂 Core Submodules:

`data/` — Data Engineering & Pipeline Management

`utils/` — System Hooks & Reproducibility

`models/` — Neural Network Topologies

Execution & Training Setup

Running the pipeline:

📜 Academic Context & Data Availability

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Image Classification Framework with Synthetic Data Integration

📌 Core Architecture & Methodology

Key Engineering Features:

🛠 Repository Structure

📂 Core Submodules:

data/ — Data Engineering & Pipeline Management

utils/ — System Hooks & Reproducibility

models/ — Neural Network Topologies

Execution & Training Setup

Running the pipeline:

📜 Academic Context & Data Availability

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data/` — Data Engineering & Pipeline Management

`utils/` — System Hooks & Reproducibility

`models/` — Neural Network Topologies

Packages