[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores)#9
Merged
BelixRogner merged 7 commits intoMay 18, 2026
Conversation
Same bug as PR lightgbm-org#6 fixed for the in-block PercentileDevice, but in the global-memory kernel used for init-score computation. The unweighted branch of PercentileGlobalKernel computed the percentile position against `len` instead of `len - 1`, biasing alpha=0.5 toward the upper-middle element on descending-sort layouts. Reproducer (with the Python wrapper's optimization that drops uniform weights, this is the path actually executed by `objective=regression_l1` or `quantile` when sample weights aren't supplied or are all 1): y = [1, 2, 3, 4, 5] init_score (numpy median): 3.0 CPU init_score: 3.0 (correct) CUDA init_score (before): 3.5 (biased toward upper) CUDA init_score (after): 3.0 (correct) This fix mirrors PR lightgbm-org#6 in PercentileDevice and uses the same Type-7 interpolated-quantile formula: float_pos = (1 - alpha) * (len - 1) pos = floor(float_pos) + 1 bias = float_pos - (pos - 1) Parity-sweep impact: reg_l1 max|Δ|: 0.25 -> 0.000e+00 reg_quantile max|Δ|: 0.54 -> 0.000e+00 The weighted branch of PercentileGlobalKernel uses different conventions and is not touched by this PR. There appears to be an unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in which cdf delta is used in the interpolation), but that affects only non-uniform-weight workloads and is out of scope here - the Python wrapper drops uniform weights, so this PR's unweighted-formula fix already covers the common path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit. 24 parametrized cases across (objective, alpha, n) verifying the init score logged by 'Start training from score' matches between CPU and CUDA at FP epsilon. Without the fix, regression_l1 (alpha=0.5) and quantile failed for small n where the formula bias landed on a different element. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 10, 2026
Owner
|
Thanks Max, and Claude Code. Same formula bug as PR #6 in a different function — the init-score Two small things before merge:
|
2 tasks
Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
BelixRogner
pushed a commit
that referenced
this pull request
May 18, 2026
BelixRogner
pushed a commit
that referenced
this pull request
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Same bug as PR #6 fixed for the in-block
PercentileDevice, but in the global-memory kernelPercentileGlobalKernelused for init-score computation byobjective=regression_l1andobjective=quantile.The unweighted branch computed the percentile position against
leninstead oflen - 1, biasingalpha=0.5toward the upper-middle element on the descending-sort layout used here.Reproducer
The Python wrapper drops uniform weights (
np.all(weight == 1) → weight = Noneinpython-package/lightgbm/basic.py:3115-3122), so the unweighted path is what actually runs for the commonweight=Nonecase:Fix
Mirrors PR #6 in
PercentileDevice. The formula now matches CPU'sPercentileFunandnumpy.median/ R'squantile()Type-7 default:Parity-sweep impact
Two cases that have been divergent across the entire investigation are now bit-perfect:
reg_l1max|Δ| raw_scorereg_quantilemax|Δ| raw_scoreThe remaining divergent cases in the parity sweep (
reg_bagging,multi_dense) are FP-precision drift in parallel reductions — see lightgbm-org#6055 for the maintainer's framing of those as expected.Scope
PercentileGlobalKernelis not touched here. It uses different sort/threshold conventions and hits an unrelated off-by-one bug shared with CPU'sWeightedPercentileFun(different bug shape on each side; both are wrong vs Type-7 weighted median). The Python wrapper drops uniform weights, so the unweighted branch is what users actually exercise; the weighted-branch fix can be a separate PR.Test plan
tests/python_package_test/test_dual.py(gated onLIGHTGBM_TEST_CUDA=1) covering(objective, alpha, n)∈ {regression_l1, quantile} × {0.3, 0.5, 0.7} × {5, 7, 10, 11, 100, 500} — all pass with this fix.reg_l1andreg_quantilecases of CPU/CUDA parity sweep now match at FP epsilon.🤖 Generated with Claude Code