Skip to content

Raafat5566/data-imbalance-regression

Repository files navigation

Data Imbalance Regression

A clean and portfolio-ready implementation of techniques for imbalanced regression in time-series settings.

Project Purpose

This project studies a practical machine learning problem: how to improve regression performance in rare but high-impact scenarios without hurting performance in common scenarios.

In many real-world sensor and time-series systems, the target distribution is heavily imbalanced. Most samples come from normal operating conditions, while extreme cases are scarce. Standard training often prioritizes frequent regions and underperforms exactly where precision is most important.

The purpose of this repository is to provide a clean, reusable, and public-safe reference implementation of imbalance-aware regression strategies that can be adapted to many continuous prediction tasks.

Core goals:

  • Improve learning in underrepresented target regions
  • Preserve stability in high-frequency regions
  • Keep methods modular, testable, and reproducible
  • Provide practical examples without exposing private data

Methods Overview

This repository brings together complementary methods that address imbalance at different stages of the learning pipeline.

1. LDS (Label Distribution Smoothing)

What it does:

  • Smooths the empirical target distribution using a local kernel (for example, Gaussian)
  • Converts smoothed density into sample weights

How it works:

  • Each target value is mapped to a bin
  • Bin counts are smoothed across neighbors
  • Samples from sparse bins receive higher weights during training

Why it helps:

  • Reduces the dominance of frequent labels
  • Gives rare target ranges stronger gradient contribution

2. FDS (Feature Distribution Smoothing)

What it does:

  • Regularizes intermediate feature representations across target bins
  • Aligns statistics (mean/variance) between neighboring regions

How it works:

  • Tracks running feature statistics per target bucket
  • Smooths bucket statistics over time
  • Calibrates features using smoothed statistics during training

Why it helps:

  • Prevents unstable representations in sparse regions
  • Improves continuity and robustness across the target space

LDS + FDS pipeline illustration:

LDS and FDS model flow

3. HCA (Hierarchical Coarse-to-Fine Adjustment)

What it does:

  • Reformulates a single difficult regression objective into multiple hierarchical prediction levels

How it works:

  • Learns coarse structure first (broad target regions)
  • Adds finer-grained predictions for precision
  • Combines hierarchical supervision to stabilize learning

Why it helps:

  • Improves optimization under imbalance
  • Balances global structure and local precision

HCA architecture illustration:

HCA coarse-to-fine flow

3.5 Transformer Baseline

What it does:

  • Uses sequence self-attention to aggregate temporal information across timesteps

How it works:

  • Projects each timestep into an embedding space
  • Adds positional information to preserve order
  • Passes sequence through stacked transformer encoder layers
  • Uses a final regression head to predict the continuous target

Why it helps:

  • Captures long-range temporal dependencies
  • Provides a strong sequence modeling baseline for comparison

Transformer model flow illustration:

Transformer model flow

4. UVOTE (Uncertainty Voting Ensemble)

What it does:

  • Uses multiple expert heads and uncertainty-aware selection

How it works:

  • Trains diverse regressors with different weighting behavior
  • Each head predicts both value and confidence/uncertainty
  • At inference, the system chooses the most reliable expert per sample

Why it helps:

  • Encourages specialization across target regions
  • Improves robustness under distribution skew
  • Produces more reliable predictions in difficult cases

Best Performing Strategy

Across the implemented approaches, the UVOTE-style ensemble was the most effective overall in balancing two competing objectives:

  • Better performance in rare, critical regions
  • No major degradation in common regions

In short, uncertainty-aware expert selection provided the strongest trade-off between robustness and precision for imbalanced regression.

What This Repository Includes

  • Label Distribution Smoothing (LDS)
  • Feature Distribution Smoothing (FDS)
  • Weighted regression losses
  • Sequence dataset utilities
  • Reproducibility helpers
  • Synthetic example training script
  • Lightweight unit tests

Confidentiality and Safety

This version is sanitized for public sharing:

  • No raw input data
  • No generated predictions or submissions
  • No model checkpoints or private artifacts
  • No competition IDs, private paths, or organization-sensitive metadata

See docs/CONFIDENTIALITY.md for details.

For public publishing hygiene, also see:

Installation

pip install -r requirements.txt
pip install -e .

Quick Start

Run a toy example with synthetic data:

python examples/train_toy.py

Project Structure

data-imbalance-regression/
  src/data_imbalance_regression/
    datasets.py
    fds.py
    lds.py
    losses.py
    metrics.py
    reproducibility.py
  examples/
    train_toy.py
  tests/
    test_losses.py
    test_lds.py
  docs/
    CONFIDENTIALITY.md

Acknowledgments

This work is inspired by prior research on deep imbalanced regression and was adapted into a generalized, confidentiality-safe codebase for educational and portfolio use.

Team

Contributed by:

  • Mohamed Shattat
  • Natchiket
  • Prathhek
  • Vibour

About

Python package for handling data imbalance in regression — resampling, weighting & benchmarking strategies for skewed continuous targets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages