Data Imbalance Regression

A clean and portfolio-ready implementation of techniques for imbalanced regression in time-series settings.

Project Purpose

This project studies a practical machine learning problem: how to improve regression performance in rare but high-impact scenarios without hurting performance in common scenarios.

In many real-world sensor and time-series systems, the target distribution is heavily imbalanced. Most samples come from normal operating conditions, while extreme cases are scarce. Standard training often prioritizes frequent regions and underperforms exactly where precision is most important.

The purpose of this repository is to provide a clean, reusable, and public-safe reference implementation of imbalance-aware regression strategies that can be adapted to many continuous prediction tasks.

Core goals:

Improve learning in underrepresented target regions
Preserve stability in high-frequency regions
Keep methods modular, testable, and reproducible
Provide practical examples without exposing private data

Methods Overview

This repository brings together complementary methods that address imbalance at different stages of the learning pipeline.

1. LDS (Label Distribution Smoothing)

What it does:

Smooths the empirical target distribution using a local kernel (for example, Gaussian)
Converts smoothed density into sample weights

How it works:

Each target value is mapped to a bin
Bin counts are smoothed across neighbors
Samples from sparse bins receive higher weights during training

Why it helps:

Reduces the dominance of frequent labels
Gives rare target ranges stronger gradient contribution

2. FDS (Feature Distribution Smoothing)

What it does:

Regularizes intermediate feature representations across target bins
Aligns statistics (mean/variance) between neighboring regions

How it works:

Tracks running feature statistics per target bucket
Smooths bucket statistics over time
Calibrates features using smoothed statistics during training

Why it helps:

Prevents unstable representations in sparse regions
Improves continuity and robustness across the target space

LDS + FDS pipeline illustration:

3. HCA (Hierarchical Coarse-to-Fine Adjustment)

What it does:

Reformulates a single difficult regression objective into multiple hierarchical prediction levels

How it works:

Learns coarse structure first (broad target regions)
Adds finer-grained predictions for precision
Combines hierarchical supervision to stabilize learning

Why it helps:

Improves optimization under imbalance
Balances global structure and local precision

HCA architecture illustration:

3.5 Transformer Baseline

What it does:

Uses sequence self-attention to aggregate temporal information across timesteps

How it works:

Projects each timestep into an embedding space
Adds positional information to preserve order
Passes sequence through stacked transformer encoder layers
Uses a final regression head to predict the continuous target

Why it helps:

Captures long-range temporal dependencies
Provides a strong sequence modeling baseline for comparison

Transformer model flow illustration:

4. UVOTE (Uncertainty Voting Ensemble)

What it does:

Uses multiple expert heads and uncertainty-aware selection

How it works:

Trains diverse regressors with different weighting behavior
Each head predicts both value and confidence/uncertainty
At inference, the system chooses the most reliable expert per sample

Why it helps:

Encourages specialization across target regions
Improves robustness under distribution skew
Produces more reliable predictions in difficult cases

Best Performing Strategy

Across the implemented approaches, the UVOTE-style ensemble was the most effective overall in balancing two competing objectives:

Better performance in rare, critical regions
No major degradation in common regions

In short, uncertainty-aware expert selection provided the strongest trade-off between robustness and precision for imbalanced regression.

What This Repository Includes

Label Distribution Smoothing (LDS)
Feature Distribution Smoothing (FDS)
Weighted regression losses
Sequence dataset utilities
Reproducibility helpers
Synthetic example training script
Lightweight unit tests

Confidentiality and Safety

This version is sanitized for public sharing:

No raw input data
No generated predictions or submissions
No model checkpoints or private artifacts
No competition IDs, private paths, or organization-sensitive metadata

See docs/CONFIDENTIALITY.md for details.

For public publishing hygiene, also see:

Installation

pip install -r requirements.txt
pip install -e .

Quick Start

Run a toy example with synthetic data:

python examples/train_toy.py

Project Structure

data-imbalance-regression/
  src/data_imbalance_regression/
    datasets.py
    fds.py
    lds.py
    losses.py
    metrics.py
    reproducibility.py
  examples/
    train_toy.py
  tests/
    test_losses.py
    test_lds.py
  docs/
    CONFIDENTIALITY.md

Acknowledgments

This work is inspired by prior research on deep imbalanced regression and was adapted into a generalized, confidentiality-safe codebase for educational and portfolio use.

Team

Contributed by:

Mohamed Shattat
Natchiket
Prathhek
Vibour

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
src/data_imbalance_regression		src/data_imbalance_regression
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Imbalance Regression

Project Purpose

Methods Overview

1. LDS (Label Distribution Smoothing)

2. FDS (Feature Distribution Smoothing)

3. HCA (Hierarchical Coarse-to-Fine Adjustment)

3.5 Transformer Baseline

4. UVOTE (Uncertainty Voting Ensemble)

Best Performing Strategy

What This Repository Includes

Confidentiality and Safety

Installation

Quick Start

Project Structure

Acknowledgments

Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Imbalance Regression

Project Purpose

Methods Overview

1. LDS (Label Distribution Smoothing)

2. FDS (Feature Distribution Smoothing)

3. HCA (Hierarchical Coarse-to-Fine Adjustment)

3.5 Transformer Baseline

4. UVOTE (Uncertainty Voting Ensemble)

Best Performing Strategy

What This Repository Includes

Confidentiality and Safety

Installation

Quick Start

Project Structure

Acknowledgments

Team

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages