Skip to content

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training#8

Merged
BelixRogner merged 6 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-weighted-percentile-crash
May 19, 2026
Merged

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training#8
BelixRogner merged 6 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-weighted-percentile-crash

Conversation

@maxwbuckley
Copy link
Copy Markdown

Summary

Weighted L1 (objective=regression_l1) and weighted quantile training crash CUDA with "illegal memory access" for any dataset with n ≥ ~100 samples:

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered
.../cuda_regression_objective.cu 225

Reproducer:

import numpy as np, lightgbm as lgb
n = 100
X = np.random.randn(n, 3); y = np.random.randn(n); w = np.random.rand(n)
ds = lgb.Dataset(X, label=y, weight=w)
lgb.train({"objective": "regression_l1", "device_type": "cuda",
           "num_leaves": 4, "min_data_in_leaf": 1}, ds, num_boost_round=1)

Root cause

In include/LightGBM/cuda/cuda_algorithms.hpp, ShuffleSortedPrefixSumDevice reads:

__shared__ REDUCE_VAL_T shared_buffer[WARPSIZE];   // 32 entries
...
thread_sum = ShufflePrefixSumExclusive<REDUCE_VAL_T>(thread_sum, shared_buffer);
const REDUCE_VAL_T thread_base = shared_buffer[threadIdx.x];   // ← OOB

shared_buffer has only WARPSIZE = 32 entries, but the renewal kernels launch with blockDim.x = 256 (GET_GRADIENTS_BLOCK_SIZE_REGRESSION / 4). For threadIdx.x in [32, 256), the read is out-of-bounds. The garbage value then propagates to bad indices in downstream global-memory accesses, eventually triggering "illegal memory access".

The function also had a second related bug: the inner loop out_values[index] = thread_base + in_values[...] does not cumulate within the per-thread chunk. It is correct only when num_data_per_thread == 1. Both bugs are fixed by the same change.

Fix

Use the per-thread exclusive prefix sum already returned by ShufflePrefixSumExclusive (matching the existing correct usage in GlobalMemoryPrefixSum at line 183) and cumulate inclusively across the chunk:

   thread_sum = ShufflePrefixSumExclusive<REDUCE_VAL_T>(thread_sum, shared_buffer);
-  const REDUCE_VAL_T thread_base = shared_buffer[threadIdx.x];
+  REDUCE_VAL_T running = thread_sum;
   for (INDEX_T index = start; index < end; ++index) {
-    out_values[index] = thread_base + static_cast<REDUCE_VAL_T>(in_values[sorted_indices[index]]);
+    running += static_cast<REDUCE_VAL_T>(in_values[sorted_indices[index]]);
+    out_values[index] = running;
   }

Test plan

  • Verified on RTX 5090 / CUDA 13.2: weighted L1 and weighted quantile train successfully on n in {100, 200, 500, 1000}. Predictions match CPU within typical L1/quantile FP-precision range.
  • Added regression test test_cuda_weighted_percentile_renewal_does_not_crash in test_dual.py (gated on LIGHTGBM_TEST_CUDA=1), parametrized over both objectives and 4 dataset sizes — all 8 pass with the fix.
  • Test with the unweighted percentile path also exercising ShuffleSortedPrefixSumDevice (PR [cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6 only fixed the unweighted PercentileDevice — this PR fixes the weighted-only ShuffleSortedPrefixSumDevice it depends on).

Why this is a real bug, not FP-precision drift

This is a crash, not a numerical-precision difference. The "expected behavior" framing in lightgbm-org#6055 does not apply — anyone who tries to train weighted L1 or weighted quantile on CUDA gets a hard error.

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits May 10, 2026 02:09
The function had two related bugs:

1. shared_buffer is declared __shared__ REDUCE_VAL_T shared_buffer[WARPSIZE]
   (32 entries), but the line `const REDUCE_VAL_T thread_base =
   shared_buffer[threadIdx.x]` reads at threadIdx.x in [0, blockDim.x).
   When blockDim.x > WARPSIZE (e.g. 256 for the L1/quantile renewal
   kernels), threadIdx.x in [WARPSIZE, blockDim.x) reads out-of-bounds
   shared memory.

2. The loop body `out_values[index] = thread_base + in_values[...]`
   does not cumulate within the per-thread chunk. It is correct only
   when num_data_per_thread == 1.

Together these manifest as an "illegal memory access" crash on weighted
L1 / weighted quantile training with n >= ~100 samples. Symptom:

    [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered
    .../cuda_regression_objective.cu 225 (SynchronizeCUDADevice after
    RenewTreeOutputCUDAKernel_RegressionL1<USE_WEIGHT=true>)

Fix: use the per-thread exclusive prefix sum already returned by
ShufflePrefixSumExclusive (matching the existing correct usage in
GlobalMemoryPrefixSum at line 183), and cumulate inclusively across
the chunk.

Verified: weighted L1 and weighted quantile now train successfully on
n in {100, 200, 500, 1000} on RTX 5090 / CUDA 13.2. Predictions match
CPU within the typical L1/quantile FP-precision range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies CUDA weighted L1 / weighted quantile training does not raise
"illegal memory access" for n in {100, 200, 500, 1000}. Without the
prior fix, these all crashed in ShuffleSortedPrefixSumDevice.

Gated on LIGHTGBM_TEST_CUDA=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Thanks Max, and Claude Code. Solid diagnosis — the OOB shared_buffer[threadIdx.x] read at threadIdx.x ≥ WARPSIZE, plus the subtle second bug that the per-thread chunk wasn't cumulating, both fixed by the same change with exactly the right model: thread_sum is already the exclusive prefix from ShufflePrefixSumExclusive, accumulate inclusively across the chunk, matches the existing correct usage in GlobalMemoryPrefixSum:183. The cross-reference to the working analog is the kind of detail that makes review easy.

One small ask: env var convention — align LIGHTGBM_TEST_CUDA to the existing TASK=cuda pattern (same note as on #6, #7, #9, #10).

Merging once that's in.

Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR lightgbm-org#8 (and the matching note on lightgbm-org#6, lightgbm-org#7, lightgbm-org#9, lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Author

Switched to TASK=cuda to match the test_engine.py convention — pushed as b920a8af. Same change going on #6, #7, #9, #10 next.

BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #7/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#7/#8/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Quick nudge — the chronic Python - latest versions failure that was hitting every open PR was fixed in #15 (merged earlier today) and the propagated update-branch on this PR confirms CI is otherwise green. Only thing keeping this one from merging now is the ruff/cpplint issue on test_dual.py (and on cuda_rank_objective.cpp for #14):

Single pre-commit run --all-files && git commit --amend --no-edit && git push --force-with-lease should clear all three (and #14 if you do those together).

Let me know when pushed and I'll merge.

@BelixRogner BelixRogner merged commit 59fb22d into BelixRogner:master May 19, 2026
58 of 60 checks passed
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants