Skip to content

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal)#6

Merged
BelixRogner merged 6 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/percentile-formula
May 18, 2026
Merged

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal)#6
BelixRogner merged 6 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/percentile-formula

Conversation

@maxwbuckley
Copy link
Copy Markdown

Summary

PercentileDevice in include/LightGBM/cuda/cuda_algorithms.hpp — used by the CUDA leaf-renewal kernels for regression_l1 and quantile objectives — computed the percentile position against len instead of len - 1, and indexed it 0-based instead of CPU's 1-based-with-+1 convention.

For alpha = 0.5 (median), this means CUDA returned:

len CPU returns (correct) CUDA before fix
even (e.g. 4) avg(sorted[1], sorted[2]) — actual median sorted[1] — biased toward upper
odd (e.g. 5) sorted[2] — actual median avg(sorted[1], sorted[2]) — biased toward upper

(sorted[] is descending here because PercentileDevice is called with ASCENDING=false.)

CPU PercentileFun (src/objective/regression_objective.hpp:28-29):

const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha);
const data_size_t pos = static_cast<data_size_t>(float_pos) + 1;
...
const double bias = float_pos - (pos - 1);

This matches the standard Type-7 interpolated quantile — the same formula numpy.median and R's quantile() use by default.

Change

-    const double float_pos = (1.0f - alpha) * len;
-    const INDEX_T pos = static_cast<INDEX_T>(float_pos);
+    const double float_pos = (1.0 - alpha) * static_cast<double>(len - 1);
+    const INDEX_T pos = static_cast<INDEX_T>(float_pos) + 1;
     ...
-      const double bias = float_pos - pos;
+      const double bias = float_pos - static_cast<double>(pos - 1);

Reproducer (before fix)

[cuda]  num_leaves=7  (regression_l1)
  leaf 0: n=  7  np.median=-0.032  cuda=+0.529  delta=+0.561
  leaf 1: n= 72  np.median=-2.110  cuda=-1.763  delta=+0.347
  leaf 2: n= 23  np.median=+1.923  cuda=+2.221  delta=+0.298
  leaf 3: n=  9  np.median=-1.104  cuda=-0.791  delta=+0.313
  leaf 4: n= 10  np.median=+1.156  cuda=+1.703  delta=+0.547
  ...

(CPU's column matched np.median to FP epsilon throughout.)

Verification (after fix)

With learning_rate=1.0 to strip away shrinkage:

[cpu]   leaf 0: n=  7  np.median=-0.41937  raw=-0.41937  delta=-0.000000
[cuda]  leaf 0: n=  7  np.median=-0.41937  raw=-0.41937  delta=-0.000000
... every leaf matches ...

[cpu]   leaf 0: n=  6  np.quantile(0.7)=+0.806  raw=+0.806  delta=-0.000000
[cuda]  leaf 0: n=  6  np.quantile(0.7)=+0.806  raw=+0.806  delta=-0.000000
... every leaf matches ...

Parity-benchmark impact (scratch/cpu_cuda_parity.py from PR #4):

before after
reg_l1 max|Δ| raw_score 0.40 0.25
reg_quantile max|Δ| raw_score 0.58 0.54

For quantile, the test sweep additionally now produces identical leaf populations on both devices for alpha=0.7. For L1 there is a residual structural divergence (CPU and CUDA pick slightly different splits even on round 1 with identical gradients) that will be investigated separately — this PR fixes only the leaf-value calculation.

Scope notes

  • The weighted-percentile path in PercentileDevice uses different conventions on CPU vs CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and is left untouched. None of the L1/quantile defaults exercise it.
  • PercentileDevice is only called from cuda_regression_objective.cu (L1 and quantile renewal kernels). No other call sites in the tree learner depend on the old behavior.

Test plan

  • scratch/probe_l1_median.py — every CUDA leaf now matches np.median to FP epsilon
  • scratch/probe_quantile.py — every CUDA leaf now matches np.quantile(α=0.7) to FP epsilon
  • Parity sweep — reg_l1 and reg_quantile both improved
  • Weighted L1 / weighted quantile (unchanged path) — not yet tested

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits May 9, 2026 22:49
The CUDA PercentileDevice (used by L1 and quantile leaf-value renewal)
computed the percentile position against `len` instead of `len - 1`,
and indexed it as 0-based instead of CPU's 1-based-with-+1 offset.
For alpha=0.5 (median), this returned the upper-middle element on
even-length arrays and the average of the upper-middle and median on
odd-length arrays - i.e., systematically biased upward in the
descending-sort convention that PercentileDevice uses.

CPU PercentileFun (src/objective/regression_objective.hpp:28-29):

    const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha);
    const data_size_t pos = static_cast<data_size_t>(float_pos) + 1;
    ...
    const double bias = float_pos - (pos - 1);

This matches the standard Type-7 interpolated quantile (numpy.median,
R's quantile() default).

Verified against numpy:
  reg_l1     leaf-value max delta vs np.median:    0.5 -> 0.0 (after fix)
  reg_quantile leaf-value max delta vs np.quantile: 0.6 -> 0.0 (after fix)

After this fix every leaf in the parity benchmark reproducer matches
its numpy counterpart to FP epsilon. There is a residual structural
divergence on reg_l1 (CPU and CUDA disagree on a few splits) which
will be investigated separately - this PR fixes only the leaf-value
calculation.

The weighted-percentile path uses different conventions on CPU and
CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and
is left untouched here. None of our parity tests exercise it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the unweighted PercentileDevice formula fix
(prior commit). Three parametrized tests, all gated on
LIGHTGBM_TEST_CUDA=1 so they only run on a CUDA-enabled build:

- test_cuda_l1_leaf_renewal_matches_numpy_median: across 3 random
  seeds, asserts every leaf value on both CPU and CUDA matches
  numpy.median over the leaf's data points.
- test_cuda_quantile_leaf_renewal_matches_numpy_quantile: same shape
  but parametrized over alpha = 0.1, 0.25, 0.5, 0.7, 0.9 to cover
  every even/odd leaf-size combination of the percentile bias.
- test_cuda_l1_median_handles_small_even_and_odd_leaves: targets the
  exact failure mode of the old formula (even-length leaves returned
  sorted[1] instead of avg(sorted[1], sorted[2])) by sweeping leaves
  of size 2, 3, 4, 5, 8, 9.

Tolerance is 1e-6 - well below the ~0.3 bias the old formula
produced, but loose enough to absorb label_t float32 quantization
inside the renewal kernel.

Verified: with the prior commit reverted, 13 of 14 cases fail with
bias > 1e-6; with the fix applied, all 14 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Thanks Max, and Claude Code. The fix is precise — formula matches CPU's PercentileFun and numpy.median / R's Type-7 default. The three-tier test coverage (sweep over seeds, sweep over alphas, then explicit small even/odd cases that target the exact failure mode) is exactly the right shape for a numerical regression.

One small ask before merge: the new tests use os.environ.get("LIGHTGBM_TEST_CUDA", "0") != "1" for the CUDA gate, but the rest of the suite (including the test you added for PR #2 that's already merged) uses getenv("TASK", "") != "cuda". Could you align to the existing TASK=cuda convention so we don't need two env vars to run the full CUDA test suite? Same applies to PRs #7, #8, #9, #10 — easiest to do all at once.

Merging once that's in.

Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on lightgbm-org#7/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Author

Switched to TASK=cuda to match the test_engine.py convention — pushed as 0cb8feb6. Same change applied on #7, #8, #9, #10.

@BelixRogner BelixRogner merged commit 2ce3ffe into BelixRogner:master May 18, 2026
56 of 57 checks passed
BelixRogner pushed a commit that referenced this pull request May 18, 2026
Same bug as PR #6 fixed for the in-block PercentileDevice, but in the
global-memory kernel used for init-score computation. The unweighted
branch of PercentileGlobalKernel computed the percentile position
against `len` instead of `len - 1`, biasing alpha=0.5 toward the
upper-middle element on descending-sort layouts.

Reproducer (with the Python wrapper's optimization that drops uniform
weights, this is the path actually executed by `objective=regression_l1`
or `quantile` when sample weights aren't supplied or are all 1):

  y = [1, 2, 3, 4, 5]
  init_score (numpy median): 3.0
  CPU init_score:            3.0  (correct)
  CUDA init_score (before):  3.5  (biased toward upper)
  CUDA init_score (after):   3.0  (correct)

This fix mirrors PR #6 in PercentileDevice and uses the same Type-7
interpolated-quantile formula:

  float_pos = (1 - alpha) * (len - 1)
  pos       = floor(float_pos) + 1
  bias      = float_pos - (pos - 1)

Parity-sweep impact:

  reg_l1     max|Δ|: 0.25  -> 0.000e+00
  reg_quantile max|Δ|: 0.54  -> 0.000e+00

The weighted branch of PercentileGlobalKernel uses different
conventions and is not touched by this PR. There appears to be an
unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in
which cdf delta is used in the interpolation), but that affects only
non-uniform-weight workloads and is out of scope here - the Python
wrapper drops uniform weights, so this PR's unweighted-formula fix
already covers the common path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#7/#8/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR #8 (and the matching note on #6, #7, #9, #10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants