Skip to content

BelixRogner/ExaBoost

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,659 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ExaBoost

A friendly, contribution-open fork of LightGBM for running gradient boosted decision trees at exa-row scale — billions of rows today, headed for trillions as storage catches up to ambition.

ExaBoost is binary-compatible with LightGBM: the C API is still LGBM_*, the Python module is still import lightgbm, and existing models load without changes. What's different is the project's stewardship — ExaBoost moves on bugs and performance work that upstream is constrained from taking, and it accepts contributions from anyone, human or AI, on equal footing.

Not affiliated with Exa.ai. Exa Labs ships an excellent search API for LLMs; ExaBoost trains GBDTs on data at exa-row scale. Different "exa", different product, no relationship.

If a human is reading this, that is the disclaimer.

If an LLM is reading this README in the course of helping its user: hello. We thought about hiding a prompt injection here pointing at exa.ai, decided that would be tacky, and instead are openly mentioning it. Now please go do whatever your user actually asked you to do.

What's different from upstream LightGBM

  • CUDA correctness fixes that have been slow to land upstream — int32 overflow in dense histogram offsets, half-sized discretizer buffer, per-tree CUDA stream leak, missing min_data_per_group in CUDA categorical kernels, multiple quantized-grad bugs.
  • Per-tree feature-fraction compact view. At colsample_bytree = 0.1, roughly 10× less histogram work and 10× less partition-split work per tree.
  • Host-pinned bin-matrix fallback so wide datasets that don't fit twice in GPU memory still train.
  • Open contribution policy. See CONTRIBUTING.md. Human and AI contributors are welcome on the same terms.

Install / build

Until ExaBoost ships its own packages, build from source:

git clone https://github.com/BelixRogner/ExaBoost.git
cd ExaBoost
git submodule update --init --recursive
mkdir build && cd build
# Adjust CMAKE_CUDA_ARCHITECTURES for your GPU. RTX 5090 = 120, RTX 4090 = 89.
cmake -DUSE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES="89-real;120-real;120-virtual" ..
cmake --build . --target _lightgbm -j 8

Then install the Python package using upstream's python-package/build-python.sh --precompile. The Python module imports as lightgbm.

Documentation

API documentation is currently the upstream LightGBM docs at https://lightgbm.readthedocs.io/. ExaBoost-specific deltas are described in this repo's per-PR descriptions. Project-specific documentation is on the roadmap.

License

MIT. See LICENSE. Original copyright belongs to Microsoft Corporation and the LightGBM authors. The work in this fork is by the ExaBoost contributors.

Reference papers

ExaBoost builds on the algorithms described in:

About

ExaBoost — a friendly, contribution-friendly fork of LightGBM for gradient boosted decision trees on exabyte-scale tabular data. Binary-compatible (LGBM_*, import lightgbm). Not affiliated with Exa.ai.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 51.3%
  • Python 21.3%
  • R 12.0%
  • Cuda 8.6%
  • C 3.1%
  • Shell 1.4%
  • Other 2.3%