A clean and portfolio-ready implementation of techniques for imbalanced regression in time-series settings.
This project studies a practical machine learning problem: how to improve regression performance in rare but high-impact scenarios without hurting performance in common scenarios.
In many real-world sensor and time-series systems, the target distribution is heavily imbalanced. Most samples come from normal operating conditions, while extreme cases are scarce. Standard training often prioritizes frequent regions and underperforms exactly where precision is most important.
The purpose of this repository is to provide a clean, reusable, and public-safe reference implementation of imbalance-aware regression strategies that can be adapted to many continuous prediction tasks.
Core goals:
- Improve learning in underrepresented target regions
- Preserve stability in high-frequency regions
- Keep methods modular, testable, and reproducible
- Provide practical examples without exposing private data
This repository brings together complementary methods that address imbalance at different stages of the learning pipeline.
What it does:
- Smooths the empirical target distribution using a local kernel (for example, Gaussian)
- Converts smoothed density into sample weights
How it works:
- Each target value is mapped to a bin
- Bin counts are smoothed across neighbors
- Samples from sparse bins receive higher weights during training
Why it helps:
- Reduces the dominance of frequent labels
- Gives rare target ranges stronger gradient contribution
What it does:
- Regularizes intermediate feature representations across target bins
- Aligns statistics (mean/variance) between neighboring regions
How it works:
- Tracks running feature statistics per target bucket
- Smooths bucket statistics over time
- Calibrates features using smoothed statistics during training
Why it helps:
- Prevents unstable representations in sparse regions
- Improves continuity and robustness across the target space
LDS + FDS pipeline illustration:
What it does:
- Reformulates a single difficult regression objective into multiple hierarchical prediction levels
How it works:
- Learns coarse structure first (broad target regions)
- Adds finer-grained predictions for precision
- Combines hierarchical supervision to stabilize learning
Why it helps:
- Improves optimization under imbalance
- Balances global structure and local precision
HCA architecture illustration:
What it does:
- Uses sequence self-attention to aggregate temporal information across timesteps
How it works:
- Projects each timestep into an embedding space
- Adds positional information to preserve order
- Passes sequence through stacked transformer encoder layers
- Uses a final regression head to predict the continuous target
Why it helps:
- Captures long-range temporal dependencies
- Provides a strong sequence modeling baseline for comparison
Transformer model flow illustration:
What it does:
- Uses multiple expert heads and uncertainty-aware selection
How it works:
- Trains diverse regressors with different weighting behavior
- Each head predicts both value and confidence/uncertainty
- At inference, the system chooses the most reliable expert per sample
Why it helps:
- Encourages specialization across target regions
- Improves robustness under distribution skew
- Produces more reliable predictions in difficult cases
Across the implemented approaches, the UVOTE-style ensemble was the most effective overall in balancing two competing objectives:
- Better performance in rare, critical regions
- No major degradation in common regions
In short, uncertainty-aware expert selection provided the strongest trade-off between robustness and precision for imbalanced regression.
- Label Distribution Smoothing (LDS)
- Feature Distribution Smoothing (FDS)
- Weighted regression losses
- Sequence dataset utilities
- Reproducibility helpers
- Synthetic example training script
- Lightweight unit tests
This version is sanitized for public sharing:
- No raw input data
- No generated predictions or submissions
- No model checkpoints or private artifacts
- No competition IDs, private paths, or organization-sensitive metadata
See docs/CONFIDENTIALITY.md for details.
For public publishing hygiene, also see:
pip install -r requirements.txt
pip install -e .Run a toy example with synthetic data:
python examples/train_toy.pydata-imbalance-regression/
src/data_imbalance_regression/
datasets.py
fds.py
lds.py
losses.py
metrics.py
reproducibility.py
examples/
train_toy.py
tests/
test_losses.py
test_lds.py
docs/
CONFIDENTIALITY.md
This work is inspired by prior research on deep imbalanced regression and was adapted into a generalized, confidentiality-safe codebase for educational and portfolio use.
Contributed by:
- Mohamed Shattat
- Natchiket
- Prathhek
- Vibour


