[cuda] fix illegal-memory-access crash in weighted L1 / quantile training#8
Conversation
The function had two related bugs:
1. shared_buffer is declared __shared__ REDUCE_VAL_T shared_buffer[WARPSIZE]
(32 entries), but the line `const REDUCE_VAL_T thread_base =
shared_buffer[threadIdx.x]` reads at threadIdx.x in [0, blockDim.x).
When blockDim.x > WARPSIZE (e.g. 256 for the L1/quantile renewal
kernels), threadIdx.x in [WARPSIZE, blockDim.x) reads out-of-bounds
shared memory.
2. The loop body `out_values[index] = thread_base + in_values[...]`
does not cumulate within the per-thread chunk. It is correct only
when num_data_per_thread == 1.
Together these manifest as an "illegal memory access" crash on weighted
L1 / weighted quantile training with n >= ~100 samples. Symptom:
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered
.../cuda_regression_objective.cu 225 (SynchronizeCUDADevice after
RenewTreeOutputCUDAKernel_RegressionL1<USE_WEIGHT=true>)
Fix: use the per-thread exclusive prefix sum already returned by
ShufflePrefixSumExclusive (matching the existing correct usage in
GlobalMemoryPrefixSum at line 183), and cumulate inclusively across
the chunk.
Verified: weighted L1 and weighted quantile now train successfully on
n in {100, 200, 500, 1000} on RTX 5090 / CUDA 13.2. Predictions match
CPU within the typical L1/quantile FP-precision range.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies CUDA weighted L1 / weighted quantile training does not raise
"illegal memory access" for n in {100, 200, 500, 1000}. Without the
prior fix, these all crashed in ShuffleSortedPrefixSumDevice.
Gated on LIGHTGBM_TEST_CUDA=1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks Max, and Claude Code. Solid diagnosis — the OOB One small ask: env var convention — align Merging once that's in. |
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR lightgbm-org#8 (and the matching note on lightgbm-org#6, lightgbm-org#7, lightgbm-org#9, lightgbm-org#10).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Quick nudge — the chronic
Single Let me know when pushed and I'll merge. |
Summary
Weighted L1 (
objective=regression_l1) and weighted quantile training crash CUDA with "illegal memory access" for any dataset with n ≥ ~100 samples:Reproducer:
Root cause
In
include/LightGBM/cuda/cuda_algorithms.hpp,ShuffleSortedPrefixSumDevicereads:shared_bufferhas onlyWARPSIZE = 32entries, but the renewal kernels launch withblockDim.x = 256(GET_GRADIENTS_BLOCK_SIZE_REGRESSION / 4). ForthreadIdx.xin[32, 256), the read is out-of-bounds. The garbage value then propagates to bad indices in downstream global-memory accesses, eventually triggering "illegal memory access".The function also had a second related bug: the inner loop
out_values[index] = thread_base + in_values[...]does not cumulate within the per-thread chunk. It is correct only whennum_data_per_thread == 1. Both bugs are fixed by the same change.Fix
Use the per-thread exclusive prefix sum already returned by
ShufflePrefixSumExclusive(matching the existing correct usage inGlobalMemoryPrefixSumat line 183) and cumulate inclusively across the chunk:Test plan
test_cuda_weighted_percentile_renewal_does_not_crashintest_dual.py(gated onLIGHTGBM_TEST_CUDA=1), parametrized over both objectives and 4 dataset sizes — all 8 pass with the fix.ShuffleSortedPrefixSumDevice(PR [cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6 only fixed the unweighted PercentileDevice — this PR fixes the weighted-onlyShuffleSortedPrefixSumDeviceit depends on).Why this is a real bug, not FP-precision drift
This is a crash, not a numerical-precision difference. The "expected behavior" framing in lightgbm-org#6055 does not apply — anyone who tries to train weighted L1 or weighted quantile on CUDA gets a hard error.
🤖 Generated with Claude Code