This project implements a custom gradient boosting algorithm from scratch for classifying liver cirrhosis stages, utilizing pandas and numpy for core computations. The implementation features a novel optimization approach using the multivariate Newton-Raphson method to approximate cross-entropy loss.
- Python 3.13.0
- Pandas 2.2.3
- Numpy 1.26.4
- MatPlotLib 3.10.0
- Custom Gradient Boosting with configurable hyperparameters
- Multivariate Newton-Raphson Optimization for cross-entropy minimization
- Multithreading for cross-validation
The principle of gradient boosting resides in the sequential construction of weak learners that approximate the negative gradient of the loss function. Cross-entropy was used as a loss function
We will look for the new weak learner as a constant correction to the previous one. To approximate the loss function, we decompose it into a Taylor series and find a constant that minimizes the loss function
Substituting the resulting expression into the loss function, we obtain an approximation of Gain
The model currently achieves AUC-ROC less 0.5, indicating no discriminative capability beyond random chance.
Possible reasons:
- Insufficient ensemble size (<200 weak learners)
- Suboptimal hyperparameter configuration (tree depth, learning rate)
- Potential implementation constraints in Newton-Raphson optimization
The fundamental bottleneck arises from insufficient computational throughput, constraining the development of robust tree ensembles even with parallelized multithreading implementations.
#clone repository:
git clone git@github.com:Mayerle/MLClassification.git
#install requirements:
pip install -r requirements.txtmain.py: loading and using saved modelrandom_search.py: general hyperparameters searchrandom_search_l1l2.py: l1 and l2 hyperparameters searchtrain.py: training model
- Baseline always predict the most frequent class.
- Model is gradient boosted trees with depth 3 and trees count 200

