[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal)#6
Conversation
The CUDA PercentileDevice (used by L1 and quantile leaf-value renewal)
computed the percentile position against `len` instead of `len - 1`,
and indexed it as 0-based instead of CPU's 1-based-with-+1 offset.
For alpha=0.5 (median), this returned the upper-middle element on
even-length arrays and the average of the upper-middle and median on
odd-length arrays - i.e., systematically biased upward in the
descending-sort convention that PercentileDevice uses.
CPU PercentileFun (src/objective/regression_objective.hpp:28-29):
const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha);
const data_size_t pos = static_cast<data_size_t>(float_pos) + 1;
...
const double bias = float_pos - (pos - 1);
This matches the standard Type-7 interpolated quantile (numpy.median,
R's quantile() default).
Verified against numpy:
reg_l1 leaf-value max delta vs np.median: 0.5 -> 0.0 (after fix)
reg_quantile leaf-value max delta vs np.quantile: 0.6 -> 0.0 (after fix)
After this fix every leaf in the parity benchmark reproducer matches
its numpy counterpart to FP epsilon. There is a residual structural
divergence on reg_l1 (CPU and CUDA disagree on a few splits) which
will be investigated separately - this PR fixes only the leaf-value
calculation.
The weighted-percentile path uses different conventions on CPU and
CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and
is left untouched here. None of our parity tests exercise it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the unweighted PercentileDevice formula fix (prior commit). Three parametrized tests, all gated on LIGHTGBM_TEST_CUDA=1 so they only run on a CUDA-enabled build: - test_cuda_l1_leaf_renewal_matches_numpy_median: across 3 random seeds, asserts every leaf value on both CPU and CUDA matches numpy.median over the leaf's data points. - test_cuda_quantile_leaf_renewal_matches_numpy_quantile: same shape but parametrized over alpha = 0.1, 0.25, 0.5, 0.7, 0.9 to cover every even/odd leaf-size combination of the percentile bias. - test_cuda_l1_median_handles_small_even_and_odd_leaves: targets the exact failure mode of the old formula (even-length leaves returned sorted[1] instead of avg(sorted[1], sorted[2])) by sweeping leaves of size 2, 3, 4, 5, 8, 9. Tolerance is 1e-6 - well below the ~0.3 bias the old formula produced, but loose enough to absorb label_t float32 quantization inside the renewal kernel. Verified: with the prior commit reverted, 13 of 14 cases fail with bias > 1e-6; with the fix applied, all 14 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
55d140c to
621e9b4
Compare
|
Thanks Max, and Claude Code. The fix is precise — formula matches CPU's One small ask before merge: the new tests use Merging once that's in. |
Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#7/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same bug as PR #6 fixed for the in-block PercentileDevice, but in the global-memory kernel used for init-score computation. The unweighted branch of PercentileGlobalKernel computed the percentile position against `len` instead of `len - 1`, biasing alpha=0.5 toward the upper-middle element on descending-sort layouts. Reproducer (with the Python wrapper's optimization that drops uniform weights, this is the path actually executed by `objective=regression_l1` or `quantile` when sample weights aren't supplied or are all 1): y = [1, 2, 3, 4, 5] init_score (numpy median): 3.0 CPU init_score: 3.0 (correct) CUDA init_score (before): 3.5 (biased toward upper) CUDA init_score (after): 3.0 (correct) This fix mirrors PR #6 in PercentileDevice and uses the same Type-7 interpolated-quantile formula: float_pos = (1 - alpha) * (len - 1) pos = floor(float_pos) + 1 bias = float_pos - (pos - 1) Parity-sweep impact: reg_l1 max|Δ|: 0.25 -> 0.000e+00 reg_quantile max|Δ|: 0.54 -> 0.000e+00 The weighted branch of PercentileGlobalKernel uses different conventions and is not touched by this PR. There appears to be an unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in which cdf delta is used in the interpolation), but that affects only non-uniform-weight workloads and is out of scope here - the Python wrapper drops uniform weights, so this PR's unweighted-formula fix already covers the common path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
PercentileDeviceininclude/LightGBM/cuda/cuda_algorithms.hpp— used by the CUDA leaf-renewal kernels forregression_l1andquantileobjectives — computed the percentile position againstleninstead oflen - 1, and indexed it 0-based instead of CPU's 1-based-with-+1 convention.For
alpha = 0.5(median), this means CUDA returned:len(
sorted[]is descending here becausePercentileDeviceis called withASCENDING=false.)CPU
PercentileFun(src/objective/regression_objective.hpp:28-29):This matches the standard Type-7 interpolated quantile — the same formula
numpy.medianand R'squantile()use by default.Change
Reproducer (before fix)
(CPU's column matched
np.medianto FP epsilon throughout.)Verification (after fix)
With
learning_rate=1.0to strip away shrinkage:Parity-benchmark impact (
scratch/cpu_cuda_parity.pyfrom PR #4):reg_l1max|Δ| raw_scorereg_quantilemax|Δ| raw_scoreFor quantile, the test sweep additionally now produces identical leaf populations on both devices for
alpha=0.7. For L1 there is a residual structural divergence (CPU and CUDA pick slightly different splits even on round 1 with identical gradients) that will be investigated separately — this PR fixes only the leaf-value calculation.Scope notes
PercentileDeviceuses different conventions on CPU vs CUDA (ascending vs descending sort,alphavs1-alphathreshold) and is left untouched. None of the L1/quantile defaults exercise it.PercentileDeviceis only called fromcuda_regression_objective.cu(L1 and quantile renewal kernels). No other call sites in the tree learner depend on the old behavior.Test plan
scratch/probe_l1_median.py— every CUDA leaf now matchesnp.medianto FP epsilonscratch/probe_quantile.py— every CUDA leaf now matchesnp.quantile(α=0.7)to FP epsilonreg_l1andreg_quantileboth improved🤖 Generated with Claude Code