Skip to content

SnowFox999/classifier_synth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical Image Classification Framework with Synthetic Data Integration

This repository contains the evaluation framework and classification pipeline developed for my Master's Thesis at TU Ilmenau. The project focuses on assessing the downstream classification performance of an EfficientNet-B0 classifier when augmented with synthetic medical images generated via fine-tuned Diffusion Models (Stable Diffusion + LoRA).

The core objective is to analyze the statistical utility of generative augmentation under different dataset blending strategies (proportional ratio scaling vs. rigid minority class balancing).


📌 Core Architecture & Methodology

The classification pipeline is designed to enforce maximum mathematical rigor and prevent overoptimistic evaluation metrics common in medical imaging applications.

Key Engineering Features:

  • Strict Lesion-Level Splitting: To avoid critical data leakage, the dataset splitting mechanism ensures that multiple images belonging to the same skin lesion (from the BCN20000 dataset) are completely isolated within either the training or the testing fold. No identical lesion data spans across the validation sets.
  • 5-Fold Cross-Validation: Built-in programmatic execution of stratified 5-fold cross-validation to ensure stable and reproducible performance tracking.
  • Fold-Specific Synthetic Ingestion: The architecture is decoupled to ingest precisely mapped synthetic image subsets into each specific training fold, maintaining absolute isolation of the original real validation subsets.

🛠 Repository Structure

The framework follows a modular, scalable ML-engineering layout, grouping core functionalities into specialized directories:

  • main.py: The primary execution entry point. Orchestrates the 5-fold cross-validation pipeline, triggers training loops, and logs execution stages.
  • config.py: Centralized configuration management. Controls structural constants, hyperparameter definitions (learning rates, batch allocations, optimizer steps), and directory path mappings.
  • eval.ipynb: Downstream analytics workspace. Used for performance inspection, computing confusion matrices, and generating aggregate metric plots across completed experiment runs.

📂 Core Submodules:

data/ — Data Engineering & Pipeline Management

  • datamodule.py: Encapsulates high-level PyTorch Lightning-style or custom dataset piping, orchestrating inputs for both training routines and test evaluations.
  • dataset.py: Low-level PyTorch Dataset override custom-tailored for BCN20000 image streams. Handles raw image I/O, label dictionary alignments, and metadata column parsing.
  • splits.py: Crucial Isolation Layer. Houses the logic responsible for the lesion-level dataset partition. Guarantees that no patient-specific lesion IDs overlap between training folds and validation subsets, completely preventing target data leakage.
  • transforms.py: Houses pixel-level data augmentation techniques (e.g., color jitters, geometric rotations, spatial normalizations) used to artificially expand the original training distributions.

utils/ — System Hooks & Reproducibility

  • seed.py: Centralized deterministic configuration hook. Programmatically seeds global random states across numpy, random, and torch to guarantee 100% execution replicability.
  • plots.py: Automation utility script for rendering training trajectories, learning curves, loss convergence tracking, and exporting performance evaluations into static vector graphics.

models/ — Neural Network Topologies

  • Dedicated space wrapping the EfficientNet-B0 architecture, managing parameter freezing protocols, and adapting the final fully connected classification head layers to match the targeted clinical diagnostics classes.

Execution & Training Setup

The training routing is heavily automated. Ensure your paths and hyperparameters are correctly initialized inside config.py before running the framework.

Running the pipeline:

To train the classifier across the pre-defined folds with custom synthetic blend ratios:

python main.py

📜 Academic Context & Data Availability

This project represents the core classifier validation engine of a Master's Thesis completed at the Technische Universität Ilmenau.

  • Data Access: Source dermoscopic data originates from the public BCN20000 dataset hosted on the ISIC Archive.
  • Replicability: Code adjustments are generic enough to support external ISIC-formatted classification targets. Synthetic data integration formats follow standard ISIC schema conventions.

For professional networking, data science collaboration, or industrial research opportunities in MedTech/AI, let's connect on LinkedIn.

About

Medical Image Classification Framework with Synthetic Data Integration

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors