[cuda] fix illegal-memory-access crash in weighted L1 / quantile training by maxwbuckley · Pull Request #8 · BelixRogner/ExaBoost

maxwbuckley · 2026-05-10T00:11:26Z

Summary

Weighted L1 (objective=regression_l1) and weighted quantile training crash CUDA with "illegal memory access" for any dataset with n ≥ ~100 samples:

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered
.../cuda_regression_objective.cu 225

Reproducer:

import numpy as np, lightgbm as lgb
n = 100
X = np.random.randn(n, 3); y = np.random.randn(n); w = np.random.rand(n)
ds = lgb.Dataset(X, label=y, weight=w)
lgb.train({"objective": "regression_l1", "device_type": "cuda",
           "num_leaves": 4, "min_data_in_leaf": 1}, ds, num_boost_round=1)

Root cause

In include/LightGBM/cuda/cuda_algorithms.hpp, ShuffleSortedPrefixSumDevice reads:

__shared__ REDUCE_VAL_T shared_buffer[WARPSIZE];   // 32 entries
...
thread_sum = ShufflePrefixSumExclusive<REDUCE_VAL_T>(thread_sum, shared_buffer);
const REDUCE_VAL_T thread_base = shared_buffer[threadIdx.x];   // ← OOB

shared_buffer has only WARPSIZE = 32 entries, but the renewal kernels launch with blockDim.x = 256 (GET_GRADIENTS_BLOCK_SIZE_REGRESSION / 4). For threadIdx.x in [32, 256), the read is out-of-bounds. The garbage value then propagates to bad indices in downstream global-memory accesses, eventually triggering "illegal memory access".

The function also had a second related bug: the inner loop out_values[index] = thread_base + in_values[...] does not cumulate within the per-thread chunk. It is correct only when num_data_per_thread == 1. Both bugs are fixed by the same change.

Fix

Use the per-thread exclusive prefix sum already returned by ShufflePrefixSumExclusive (matching the existing correct usage in GlobalMemoryPrefixSum at line 183) and cumulate inclusively across the chunk:

   thread_sum = ShufflePrefixSumExclusive<REDUCE_VAL_T>(thread_sum, shared_buffer);
-  const REDUCE_VAL_T thread_base = shared_buffer[threadIdx.x];
+  REDUCE_VAL_T running = thread_sum;
   for (INDEX_T index = start; index < end; ++index) {
-    out_values[index] = thread_base + static_cast<REDUCE_VAL_T>(in_values[sorted_indices[index]]);
+    running += static_cast<REDUCE_VAL_T>(in_values[sorted_indices[index]]);
+    out_values[index] = running;
   }

Test plan

Verified on RTX 5090 / CUDA 13.2: weighted L1 and weighted quantile train successfully on n in {100, 200, 500, 1000}. Predictions match CPU within typical L1/quantile FP-precision range.
Added regression test test_cuda_weighted_percentile_renewal_does_not_crash in test_dual.py (gated on LIGHTGBM_TEST_CUDA=1), parametrized over both objectives and 4 dataset sizes — all 8 pass with the fix.
Test with the unweighted percentile path also exercising ShuffleSortedPrefixSumDevice (PR [cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6 only fixed the unweighted PercentileDevice — this PR fixes the weighted-only ShuffleSortedPrefixSumDevice it depends on).

Why this is a real bug, not FP-precision drift

This is a crash, not a numerical-precision difference. The "expected behavior" framing in lightgbm-org#6055 does not apply — anyone who tries to train weighted L1 or weighted quantile on CUDA gets a hard error.

🤖 Generated with Claude Code

The function had two related bugs: 1. shared_buffer is declared __shared__ REDUCE_VAL_T shared_buffer[WARPSIZE] (32 entries), but the line `const REDUCE_VAL_T thread_base = shared_buffer[threadIdx.x]` reads at threadIdx.x in [0, blockDim.x). When blockDim.x > WARPSIZE (e.g. 256 for the L1/quantile renewal kernels), threadIdx.x in [WARPSIZE, blockDim.x) reads out-of-bounds shared memory. 2. The loop body `out_values[index] = thread_base + in_values[...]` does not cumulate within the per-thread chunk. It is correct only when num_data_per_thread == 1. Together these manifest as an "illegal memory access" crash on weighted L1 / weighted quantile training with n >= ~100 samples. Symptom: [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered .../cuda_regression_objective.cu 225 (SynchronizeCUDADevice after RenewTreeOutputCUDAKernel_RegressionL1<USE_WEIGHT=true>) Fix: use the per-thread exclusive prefix sum already returned by ShufflePrefixSumExclusive (matching the existing correct usage in GlobalMemoryPrefixSum at line 183), and cumulate inclusively across the chunk. Verified: weighted L1 and weighted quantile now train successfully on n in {100, 200, 500, 1000} on RTX 5090 / CUDA 13.2. Predictions match CPU within the typical L1/quantile FP-precision range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verifies CUDA weighted L1 / weighted quantile training does not raise "illegal memory access" for n in {100, 200, 500, 1000}. Without the prior fix, these all crashed in ShuffleSortedPrefixSumDevice. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-10T09:39:17Z

Thanks Max, and Claude Code. Solid diagnosis — the OOB shared_buffer[threadIdx.x] read at threadIdx.x ≥ WARPSIZE, plus the subtle second bug that the per-thread chunk wasn't cumulating, both fixed by the same change with exactly the right model: thread_sum is already the exclusive prefix from ShufflePrefixSumExclusive, accumulate inclusively across the chunk, matches the existing correct usage in GlobalMemoryPrefixSum:183. The cross-reference to the working analog is the kind of detail that makes review easy.

One small ask: env var convention — align LIGHTGBM_TEST_CUDA to the existing TASK=cuda pattern (same note as on #6, #7, #9, #10).

Merging once that's in.

Aligns with the existing convention used by test_engine.py's CUDA-only tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on PR lightgbm-org#8 (and the matching note on lightgbm-org#6, lightgbm-org#7, lightgbm-org#9, lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley · 2026-05-10T09:48:37Z

Switched to TASK=cuda to match the test_engine.py convention — pushed as b920a8af. Same change going on #6, #7, #9, #10 next.

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #7/#8/#9/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #6/#7/#8/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-18T18:58:53Z

Quick nudge — the chronic Python - latest versions failure that was hitting every open PR was fixed in #15 (merged earlier today) and the propagated update-branch on this PR confirms CI is otherwise green. Only thing keeping this one from merging now is the ruff/cpplint issue on test_dual.py (and on cuda_rank_objective.cpp for #14):

PR [cuda] enforce max_depth on CUDA tree learner #7: PT006 — parametrize("max_depth,num_leaves", …) wants a tuple.
PR [cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8: I001 — imports unsorted.
PR [cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10: I001 + ruff format reformat.

Single pre-commit run --all-files && git commit --amend --no-edit && git push --force-with-lease should clear all three (and #14 if you do those together).

Let me know when pushed and I'll merge.

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #6/#8/#9/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley and others added 2 commits May 10, 2026 02:09

This was referenced May 10, 2026

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6

Merged

[cuda] enforce max_depth on CUDA tree learner #7

Merged

BelixRogner mentioned this pull request May 10, 2026

[cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10

Open

2 tasks

BelixRogner added 2 commits May 10, 2026 14:33

Merge branch 'master' into cuda/fix-weighted-percentile-crash

eec8192

Merge branch 'master' into cuda/fix-weighted-percentile-crash

c14667b

Merge branch 'master' into cuda/fix-weighted-percentile-crash

3880e4b

BelixRogner merged commit 59fb22d into BelixRogner:master May 19, 2026
58 of 60 checks passed

This was referenced May 19, 2026

[cuda] tolerance-based gain tie-break in best-split reductions #13

Open

[cuda] make LambdaRank gradient kernel bit-deterministic when deterministic=true #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training#8

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training#8
BelixRogner merged 6 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-weighted-percentile-crash

maxwbuckley commented May 10, 2026

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

BelixRogner commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented May 10, 2026

Summary

Root cause

Fix

Test plan

Why this is a real bug, not FP-precision drift

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

BelixRogner commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants