[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores) by maxwbuckley · Pull Request #9 · BelixRogner/ExaBoost

maxwbuckley · 2026-05-10T02:40:56Z

Summary

Same bug as PR #6 fixed for the in-block PercentileDevice, but in the global-memory kernel PercentileGlobalKernel used for init-score computation by objective=regression_l1 and objective=quantile.

The unweighted branch computed the percentile position against len instead of len - 1, biasing alpha=0.5 toward the upper-middle element on the descending-sort layout used here.

Reproducer

The Python wrapper drops uniform weights (np.all(weight == 1) → weight = None in python-package/lightgbm/basic.py:3115-3122), so the unweighted path is what actually runs for the common weight=None case:

y = [1, 2, 3, 4, 5]
init_score (numpy median):  3.0
CPU init_score:             3.0  (correct)
CUDA init_score (before):   3.5  (biased toward upper-middle)
CUDA init_score (after):    3.0  (correct)

Fix

Mirrors PR #6 in PercentileDevice. The formula now matches CPU's PercentileFun and numpy.median / R's quantile() Type-7 default:

-    const double float_pos = (1.0f - alpha) * len;
-    const INDEX_T pos = static_cast<INDEX_T>(float_pos);
+    const double float_pos = (1.0 - alpha) * static_cast<double>(len - 1);
+    const INDEX_T pos = static_cast<INDEX_T>(float_pos) + 1;
     ...
-      const double bias = float_pos - static_cast<double>(pos);
+      const double bias = float_pos - static_cast<double>(pos - 1);

Parity-sweep impact

Two cases that have been divergent across the entire investigation are now bit-perfect:

case	before	after
`reg_l1` max\|Δ\| raw_score	0.25	0.000e+00
`reg_quantile` max\|Δ\| raw_score	0.54	0.000e+00

The remaining divergent cases in the parity sweep (reg_bagging, multi_dense) are FP-precision drift in parallel reductions — see lightgbm-org#6055 for the maintainer's framing of those as expected.

Scope

The weighted branch of PercentileGlobalKernel is not touched here. It uses different sort/threshold conventions and hits an unrelated off-by-one bug shared with CPU's WeightedPercentileFun (different bug shape on each side; both are wrong vs Type-7 weighted median). The Python wrapper drops uniform weights, so the unweighted branch is what users actually exercise; the weighted-branch fix can be a separate PR.

Test plan

Added 24 parametrized regression tests in tests/python_package_test/test_dual.py (gated on LIGHTGBM_TEST_CUDA=1) covering (objective, alpha, n) ∈ {regression_l1, quantile} × {0.3, 0.5, 0.7} × {5, 7, 10, 11, 100, 500} — all pass with this fix.
Verified reg_l1 and reg_quantile cases of CPU/CUDA parity sweep now match at FP epsilon.

🤖 Generated with Claude Code

Same bug as PR lightgbm-org#6 fixed for the in-block PercentileDevice, but in the global-memory kernel used for init-score computation. The unweighted branch of PercentileGlobalKernel computed the percentile position against `len` instead of `len - 1`, biasing alpha=0.5 toward the upper-middle element on descending-sort layouts. Reproducer (with the Python wrapper's optimization that drops uniform weights, this is the path actually executed by `objective=regression_l1` or `quantile` when sample weights aren't supplied or are all 1): y = [1, 2, 3, 4, 5] init_score (numpy median): 3.0 CPU init_score: 3.0 (correct) CUDA init_score (before): 3.5 (biased toward upper) CUDA init_score (after): 3.0 (correct) This fix mirrors PR lightgbm-org#6 in PercentileDevice and uses the same Type-7 interpolated-quantile formula: float_pos = (1 - alpha) * (len - 1) pos = floor(float_pos) + 1 bias = float_pos - (pos - 1) Parity-sweep impact: reg_l1 max|Δ|: 0.25 -> 0.000e+00 reg_quantile max|Δ|: 0.54 -> 0.000e+00 The weighted branch of PercentileGlobalKernel uses different conventions and is not touched by this PR. There appears to be an unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in which cdf delta is used in the interpolation), but that affects only non-uniform-weight workloads and is out of scope here - the Python wrapper drops uniform weights, so this PR's unweighted-formula fix already covers the common path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression coverage for the prior commit. 24 parametrized cases across (objective, alpha, n) verifying the init score logged by 'Start training from score' matches between CPU and CUDA at FP epsilon. Without the fix, regression_l1 (alpha=0.5) and quantile failed for small n where the formula bias landed on a different element. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-10T09:39:21Z

Thanks Max, and Claude Code. Same formula bug as PR #6 in a different function — the init-score PercentileGlobalKernel rather than the per-block leaf renewal PercentileDevice. Fix is identical in shape, which is reassuring. The Python-wrapper note (weight=None → unweighted path because lightgbm/basic.py:3115-3122 drops uniform weights) is useful context for why this is the realistic user-facing case.

Two small things before merge:

Env var convention — align LIGHTGBM_TEST_CUDA to the existing TASK=cuda pattern (same as the other PRs).
Stdout scraping for the init score — _get_init_score reads "Start training from score" out of captured stdout. Works, but it couples the test to that exact log-line text. Optional follow-up: read the resulting 1-tree booster's leaf via dump_model() instead — robust to log-format changes. Not blocking; happy to merge as-is if you'd rather keep this style.

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley · 2026-05-10T09:59:00Z

Env var aligned to TASK=cuda (55a2e9dd).

Re: stdout-scraping for the init score — keeping the current style for now to match what's working on #6 and to avoid expanding the diff. Happy to fold the dump_model()-based extraction into a follow-up that also covers the same change here and on #6.

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #7/#8/#9/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on PR #8 (and the matching note on #6, #7, #9, #10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #6/#8/#9/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley and others added 2 commits May 10, 2026 04:39

This was referenced May 10, 2026

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6

Merged

[cuda] enforce max_depth on CUDA tree learner #7

Merged

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8

Merged

BelixRogner mentioned this pull request May 10, 2026

[cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10

Open

2 tasks

BelixRogner and others added 4 commits May 10, 2026 14:34

Merge branch 'master' into cuda/fix-global-percentile-formula

198942c

lint: ruff format

0360761

lint: ruff check --fix + hoist stdlib imports + tuple parametrize

0ed1fc1

Merge branch 'master' into cuda/fix-global-percentile-formula

9d5d573

BelixRogner merged commit 37c28a9 into BelixRogner:master May 18, 2026
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores)#9

[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores)#9
BelixRogner merged 7 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-global-percentile-formula

maxwbuckley commented May 10, 2026

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented May 10, 2026

Summary

Reproducer

Fix

Parity-sweep impact

Scope

Test plan

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants