Skip to content

[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores)#9

Merged
BelixRogner merged 7 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-global-percentile-formula
May 18, 2026
Merged

[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores)#9
BelixRogner merged 7 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-global-percentile-formula

Conversation

@maxwbuckley
Copy link
Copy Markdown

Summary

Same bug as PR #6 fixed for the in-block PercentileDevice, but in the global-memory kernel PercentileGlobalKernel used for init-score computation by objective=regression_l1 and objective=quantile.

The unweighted branch computed the percentile position against len instead of len - 1, biasing alpha=0.5 toward the upper-middle element on the descending-sort layout used here.

Reproducer

The Python wrapper drops uniform weights (np.all(weight == 1) → weight = None in python-package/lightgbm/basic.py:3115-3122), so the unweighted path is what actually runs for the common weight=None case:

y = [1, 2, 3, 4, 5]
init_score (numpy median):  3.0
CPU init_score:             3.0  (correct)
CUDA init_score (before):   3.5  (biased toward upper-middle)
CUDA init_score (after):    3.0  (correct)

Fix

Mirrors PR #6 in PercentileDevice. The formula now matches CPU's PercentileFun and numpy.median / R's quantile() Type-7 default:

-    const double float_pos = (1.0f - alpha) * len;
-    const INDEX_T pos = static_cast<INDEX_T>(float_pos);
+    const double float_pos = (1.0 - alpha) * static_cast<double>(len - 1);
+    const INDEX_T pos = static_cast<INDEX_T>(float_pos) + 1;
     ...
-      const double bias = float_pos - static_cast<double>(pos);
+      const double bias = float_pos - static_cast<double>(pos - 1);

Parity-sweep impact

Two cases that have been divergent across the entire investigation are now bit-perfect:

case before after
reg_l1 max|Δ| raw_score 0.25 0.000e+00
reg_quantile max|Δ| raw_score 0.54 0.000e+00

The remaining divergent cases in the parity sweep (reg_bagging, multi_dense) are FP-precision drift in parallel reductions — see lightgbm-org#6055 for the maintainer's framing of those as expected.

Scope

  • The weighted branch of PercentileGlobalKernel is not touched here. It uses different sort/threshold conventions and hits an unrelated off-by-one bug shared with CPU's WeightedPercentileFun (different bug shape on each side; both are wrong vs Type-7 weighted median). The Python wrapper drops uniform weights, so the unweighted branch is what users actually exercise; the weighted-branch fix can be a separate PR.

Test plan

  • Added 24 parametrized regression tests in tests/python_package_test/test_dual.py (gated on LIGHTGBM_TEST_CUDA=1) covering (objective, alpha, n) ∈ {regression_l1, quantile} × {0.3, 0.5, 0.7} × {5, 7, 10, 11, 100, 500} — all pass with this fix.
  • Verified reg_l1 and reg_quantile cases of CPU/CUDA parity sweep now match at FP epsilon.

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits May 10, 2026 04:39
Same bug as PR lightgbm-org#6 fixed for the in-block PercentileDevice, but in the
global-memory kernel used for init-score computation. The unweighted
branch of PercentileGlobalKernel computed the percentile position
against `len` instead of `len - 1`, biasing alpha=0.5 toward the
upper-middle element on descending-sort layouts.

Reproducer (with the Python wrapper's optimization that drops uniform
weights, this is the path actually executed by `objective=regression_l1`
or `quantile` when sample weights aren't supplied or are all 1):

  y = [1, 2, 3, 4, 5]
  init_score (numpy median): 3.0
  CPU init_score:            3.0  (correct)
  CUDA init_score (before):  3.5  (biased toward upper)
  CUDA init_score (after):   3.0  (correct)

This fix mirrors PR lightgbm-org#6 in PercentileDevice and uses the same Type-7
interpolated-quantile formula:

  float_pos = (1 - alpha) * (len - 1)
  pos       = floor(float_pos) + 1
  bias      = float_pos - (pos - 1)

Parity-sweep impact:

  reg_l1     max|Δ|: 0.25  -> 0.000e+00
  reg_quantile max|Δ|: 0.54  -> 0.000e+00

The weighted branch of PercentileGlobalKernel uses different
conventions and is not touched by this PR. There appears to be an
unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in
which cdf delta is used in the interpolation), but that affects only
non-uniform-weight workloads and is out of scope here - the Python
wrapper drops uniform weights, so this PR's unweighted-formula fix
already covers the common path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit. 24 parametrized cases
across (objective, alpha, n) verifying the init score logged by
'Start training from score' matches between CPU and CUDA at FP epsilon.

Without the fix, regression_l1 (alpha=0.5) and quantile failed for
small n where the formula bias landed on a different element.

Gated on LIGHTGBM_TEST_CUDA=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Thanks Max, and Claude Code. Same formula bug as PR #6 in a different function — the init-score PercentileGlobalKernel rather than the per-block leaf renewal PercentileDevice. Fix is identical in shape, which is reassuring. The Python-wrapper note (weight=None → unweighted path because lightgbm/basic.py:3115-3122 drops uniform weights) is useful context for why this is the realistic user-facing case.

Two small things before merge:

  1. Env var convention — align LIGHTGBM_TEST_CUDA to the existing TASK=cuda pattern (same as the other PRs).
  2. Stdout scraping for the init score_get_init_score reads "Start training from score" out of captured stdout. Works, but it couples the test to that exact log-line text. Optional follow-up: read the resulting 1-tree booster's leaf via dump_model() instead — robust to log-format changes. Not blocking; happy to merge as-is if you'd rather keep this style.

Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Author

Env var aligned to TASK=cuda (55a2e9dd).

Re: stdout-scraping for the init score — keeping the current style for now to match what's working on #6 and to avoid expanding the diff. Happy to fold the dump_model()-based extraction into a follow-up that also covers the same change here and on #6.

BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #7/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner BelixRogner merged commit 37c28a9 into BelixRogner:master May 18, 2026
56 checks passed
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR #8 (and the matching note on #6, #7, #9, #10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants