[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) by maxwbuckley · Pull Request #6 · BelixRogner/ExaBoost

maxwbuckley · 2026-05-09T20:28:43Z

Summary

PercentileDevice in include/LightGBM/cuda/cuda_algorithms.hpp — used by the CUDA leaf-renewal kernels for regression_l1 and quantile objectives — computed the percentile position against len instead of len - 1, and indexed it 0-based instead of CPU's 1-based-with-+1 convention.

For alpha = 0.5 (median), this means CUDA returned:

`len`	CPU returns (correct)	CUDA before fix
even (e.g. 4)	avg(sorted[1], sorted[2]) — actual median	sorted[1] — biased toward upper
odd (e.g. 5)	sorted[2] — actual median	avg(sorted[1], sorted[2]) — biased toward upper

(sorted[] is descending here because PercentileDevice is called with ASCENDING=false.)

CPU PercentileFun (src/objective/regression_objective.hpp:28-29):

const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha);
const data_size_t pos = static_cast<data_size_t>(float_pos) + 1;
...
const double bias = float_pos - (pos - 1);

This matches the standard Type-7 interpolated quantile — the same formula numpy.median and R's quantile() use by default.

Change

-    const double float_pos = (1.0f - alpha) * len;
-    const INDEX_T pos = static_cast<INDEX_T>(float_pos);
+    const double float_pos = (1.0 - alpha) * static_cast<double>(len - 1);
+    const INDEX_T pos = static_cast<INDEX_T>(float_pos) + 1;
     ...
-      const double bias = float_pos - pos;
+      const double bias = float_pos - static_cast<double>(pos - 1);

Reproducer (before fix)

[cuda]  num_leaves=7  (regression_l1)
  leaf 0: n=  7  np.median=-0.032  cuda=+0.529  delta=+0.561
  leaf 1: n= 72  np.median=-2.110  cuda=-1.763  delta=+0.347
  leaf 2: n= 23  np.median=+1.923  cuda=+2.221  delta=+0.298
  leaf 3: n=  9  np.median=-1.104  cuda=-0.791  delta=+0.313
  leaf 4: n= 10  np.median=+1.156  cuda=+1.703  delta=+0.547
  ...

(CPU's column matched np.median to FP epsilon throughout.)

Verification (after fix)

With learning_rate=1.0 to strip away shrinkage:

[cpu]   leaf 0: n=  7  np.median=-0.41937  raw=-0.41937  delta=-0.000000
[cuda]  leaf 0: n=  7  np.median=-0.41937  raw=-0.41937  delta=-0.000000
... every leaf matches ...

[cpu]   leaf 0: n=  6  np.quantile(0.7)=+0.806  raw=+0.806  delta=-0.000000
[cuda]  leaf 0: n=  6  np.quantile(0.7)=+0.806  raw=+0.806  delta=-0.000000
... every leaf matches ...

Parity-benchmark impact (scratch/cpu_cuda_parity.py from PR #4):

	before	after
`reg_l1` max\|Δ\| raw_score	0.40	0.25
`reg_quantile` max\|Δ\| raw_score	0.58	0.54

For quantile, the test sweep additionally now produces identical leaf populations on both devices for alpha=0.7. For L1 there is a residual structural divergence (CPU and CUDA pick slightly different splits even on round 1 with identical gradients) that will be investigated separately — this PR fixes only the leaf-value calculation.

Scope notes

The weighted-percentile path in PercentileDevice uses different conventions on CPU vs CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and is left untouched. None of the L1/quantile defaults exercise it.
PercentileDevice is only called from cuda_regression_objective.cu (L1 and quantile renewal kernels). No other call sites in the tree learner depend on the old behavior.

Test plan

scratch/probe_l1_median.py — every CUDA leaf now matches np.median to FP epsilon
scratch/probe_quantile.py — every CUDA leaf now matches np.quantile(α=0.7) to FP epsilon
Parity sweep — reg_l1 and reg_quantile both improved
Weighted L1 / weighted quantile (unchanged path) — not yet tested

🤖 Generated with Claude Code

The CUDA PercentileDevice (used by L1 and quantile leaf-value renewal) computed the percentile position against `len` instead of `len - 1`, and indexed it as 0-based instead of CPU's 1-based-with-+1 offset. For alpha=0.5 (median), this returned the upper-middle element on even-length arrays and the average of the upper-middle and median on odd-length arrays - i.e., systematically biased upward in the descending-sort convention that PercentileDevice uses. CPU PercentileFun (src/objective/regression_objective.hpp:28-29): const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha); const data_size_t pos = static_cast<data_size_t>(float_pos) + 1; ... const double bias = float_pos - (pos - 1); This matches the standard Type-7 interpolated quantile (numpy.median, R's quantile() default). Verified against numpy: reg_l1 leaf-value max delta vs np.median: 0.5 -> 0.0 (after fix) reg_quantile leaf-value max delta vs np.quantile: 0.6 -> 0.0 (after fix) After this fix every leaf in the parity benchmark reproducer matches its numpy counterpart to FP epsilon. There is a residual structural divergence on reg_l1 (CPU and CUDA disagree on a few splits) which will be investigated separately - this PR fixes only the leaf-value calculation. The weighted-percentile path uses different conventions on CPU and CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and is left untouched here. None of our parity tests exercise it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression coverage for the unweighted PercentileDevice formula fix (prior commit). Three parametrized tests, all gated on LIGHTGBM_TEST_CUDA=1 so they only run on a CUDA-enabled build: - test_cuda_l1_leaf_renewal_matches_numpy_median: across 3 random seeds, asserts every leaf value on both CPU and CUDA matches numpy.median over the leaf's data points. - test_cuda_quantile_leaf_renewal_matches_numpy_quantile: same shape but parametrized over alpha = 0.1, 0.25, 0.5, 0.7, 0.9 to cover every even/odd leaf-size combination of the percentile bias. - test_cuda_l1_median_handles_small_even_and_odd_leaves: targets the exact failure mode of the old formula (even-length leaves returned sorted[1] instead of avg(sorted[1], sorted[2])) by sweeping leaves of size 2, 3, 4, 5, 8, 9. Tolerance is 1e-6 - well below the ~0.3 bias the old formula produced, but loose enough to absorb label_t float32 quantization inside the renewal kernel. Verified: with the prior commit reverted, 13 of 14 cases fail with bias > 1e-6; with the fix applied, all 14 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-10T09:39:10Z

Thanks Max, and Claude Code. The fix is precise — formula matches CPU's PercentileFun and numpy.median / R's Type-7 default. The three-tier test coverage (sweep over seeds, sweep over alphas, then explicit small even/odd cases that target the exact failure mode) is exactly the right shape for a numerical regression.

One small ask before merge: the new tests use os.environ.get("LIGHTGBM_TEST_CUDA", "0") != "1" for the CUDA gate, but the rest of the suite (including the test you added for PR #2 that's already merged) uses getenv("TASK", "") != "cuda". Could you align to the existing TASK=cuda convention so we don't need two env vars to run the full CUDA test suite? Same applies to PRs #7, #8, #9, #10 — easiest to do all at once.

Merging once that's in.

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#7/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley · 2026-05-10T09:58:56Z

Switched to TASK=cuda to match the test_engine.py convention — pushed as 0cb8feb6. Same change applied on #7, #8, #9, #10.

Same bug as PR #6 fixed for the in-block PercentileDevice, but in the global-memory kernel used for init-score computation. The unweighted branch of PercentileGlobalKernel computed the percentile position against `len` instead of `len - 1`, biasing alpha=0.5 toward the upper-middle element on descending-sort layouts. Reproducer (with the Python wrapper's optimization that drops uniform weights, this is the path actually executed by `objective=regression_l1` or `quantile` when sample weights aren't supplied or are all 1): y = [1, 2, 3, 4, 5] init_score (numpy median): 3.0 CPU init_score: 3.0 (correct) CUDA init_score (before): 3.5 (biased toward upper) CUDA init_score (after): 3.0 (correct) This fix mirrors PR #6 in PercentileDevice and uses the same Type-7 interpolated-quantile formula: float_pos = (1 - alpha) * (len - 1) pos = floor(float_pos) + 1 bias = float_pos - (pos - 1) Parity-sweep impact: reg_l1 max|Δ|: 0.25 -> 0.000e+00 reg_quantile max|Δ|: 0.54 -> 0.000e+00 The weighted branch of PercentileGlobalKernel uses different conventions and is not touched by this PR. There appears to be an unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in which cdf delta is used in the interpolation), but that affects only non-uniform-weight workloads and is out of scope here - the Python wrapper drops uniform weights, so this PR's unweighted-formula fix already covers the common path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #6/#7/#8/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on PR #8 (and the matching note on #6, #7, #9, #10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #6/#8/#9/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley and others added 2 commits May 9, 2026 22:49

maxwbuckley force-pushed the cuda/percentile-formula branch from 55d140c to 621e9b4 Compare May 9, 2026 20:49

This was referenced May 10, 2026

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8

Merged

[cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores) #9

Merged

This was referenced May 10, 2026

[cuda] enforce max_depth on CUDA tree learner #7

Merged

[cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10

Open

[ci] disable cpplint header-guard check after repo rename to ExaBoost #11

Merged

BelixRogner and others added 3 commits May 10, 2026 14:33

Merge branch 'master' into cuda/percentile-formula

959e698

lint: ruff format

c8751fa

Merge branch 'master' into cuda/percentile-formula

5bbf812

BelixRogner merged commit 2ce3ffe into BelixRogner:master May 18, 2026
56 of 57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal)#6

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal)#6
BelixRogner merged 6 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/percentile-formula

maxwbuckley commented May 9, 2026

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented May 9, 2026

Summary

Change

Reproducer (before fix)

Verification (after fix)

Scope notes

Test plan

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants