[cuda] make BitonicArgSort_1024 / _2048 stable on tied values#10
[cuda] make BitonicArgSort_1024 / _2048 stable on tied values#10maxwbuckley wants to merge 8 commits into
Conversation
The two device-side bitonic sorts used the comparator (scores[a] > scores[b]) == ascending which evaluates to true when scores[a] == scores[b] and ascending=false (because false == false), causing ties to be swapped during a descending sort. The output index permutation for tied values then depended on the network structure rather than the input order. CPU code paths use std::stable_sort, which never swaps equal elements. Aligning CUDA's behavior closes the LambdaRank round-1 divergence where all scores are identically zero and the resulting sort permutation determines which document pairs accumulate gradient first. Symptom on a 10-query, 20-items-per-query dataset (round 1, all scores == 0): before: CPU/CUDA max|Δ| raw_score = 0.29 after: CPU/CUDA max|Δ| raw_score = 0.14 (The remaining round-1 divergence is FP-precision in atomicAdd_block order across pairs, which is documented "expected" by upstream maintainers in lightgbm-org#6055 and is unaffected by this change.) Fix: replace `(a > b) == ascending` with the strict-direction form ASCENDING ? (a > b) : (a < b) in both BitonicArgSort_1024 and the equivalent block in BitonicArgSort_2048's outer ascending/descending loops. Verified with the CPU/CUDA parity sweep: no regression on any of the previously-clean cases (categorical kernel still uses BitonicArgSort_1024 in the many-vs-many split path; that case still matches at FP epsilon). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies that CUDA LambdaRank's round-1 predictions stay within ~0.2 of CPU on a small synthetic dataset where all initial scores are zero. Without the BitonicArgSort tie-stability fix in the prior commit, this case diverged by ~0.29; with the fix it drops to ~0.14 (FP-precision residual from pair-gradient atomicAdd ordering, expected per lightgbm-org#6055). The 0.2 threshold catches the bitonic-sort regression while tolerating the FP-precision residual. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks Max, and Claude Code. The motivation is right — But I want to flag what I think is a regression in the The two halves of the diff disagree on which variable to consult. In Look at the upstream template <..., bool ASCENDING>
__device__ ... BitonicArgSort_1024(...) {
for (outer_depth ...) {
...
const bool ascending = ASCENDING ? (outer_segment_index % 2 == 0) : (outer_segment_index % 2 > 0);
// ^^^^^^^^^^^^^^^^^^^^^^^ per-pass local, NOT the template parameter
for (inner_depth ...) {
if ((... > ...) == ascending) { swap }
// ^^^^^^^^^ uses the per-pass localThat's textbook bitonic sort: outer phases alternate direction so the inner merge can pull a bitonic sequence into a monotone one. The comparator has to consult per-pass PR #10's const bool need_swap = ASCENDING // ← uppercase, template param
? (scores[...] > scores[...])
: (scores[...] < scores[...]);I believe that breaks the bitonic merge for non-tied inputs. The reason your lambdarank test passes despite the breakage is exactly the same reason the original bug only surfaced there: round-1 scores are all zero, so the strict comparator returns The
Suggested fix for const bool need_swap = ascending // lowercase, the per-pass local
? (scores[indices[threadIdx.x]] > scores[indices[index_to_compare]])
: (scores[indices[threadIdx.x]] < scores[indices[index_to_compare]]);A test with non-trivial input — e.g. a categorical-split sort with all-distinct, well-separated scores, asserting CPU and CUDA agree on the resulting permutation — would also catch this case that lambdarank round-1 can't. Happy to be wrong here if you can show me an input that exercises Same env var convention note as #6/#7/#8/#9 — align to |
…v var; cover non-tied input Felix's review on lightgbm-org#10 caught a regression in the BitonicArgSort_1024 half of the original tie-stability fix: I had switched the comparator to read the template parameter `ASCENDING` (the global sort direction) instead of the per-pass local `ascending` (which alternates by outer segment index for the bitonic merge). The all-tied LambdaRank round-1 test passed either way because strict comparison returns false for equal scores -- so the broken comparator never made a wrong swap on that input -- but for non-tied inputs the bitonic merge stops pulling a bitonic sequence into a monotone one and the sort silently produces wrong orderings. The _2048 half of the diff was already correct (uses per-pass `ascending`), so this is symmetric: switch _1024 back to `ascending` to match. Also adds test_cuda_bitonic_argsort_1024_with_distinct_scores_matches_cpu to test_dual.py: trains a regression on a single categorical feature with distinct, well-separated per-category sums, which exercises BitonicArgSort_1024 over non-tied scores via the categorical split-finder. Asserts CPU and CUDA agree on predictions after one boosting round; the wrong-comparator case would diverge. Aligns the env var gate to TASK=cuda to match test_engine.py's existing convention (same change requested on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#9). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Good catch — you're right. Pushed
|
|
Quick nudge — the chronic
Single Let me know when pushed and I'll merge. |
…-tie-stability # Conflicts: # tests/python_package_test/test_dual.py
|
Quick rebase nudge — #7 and #8 just landed on master and touched files this PR also modifies ( Ready to merge as soon as the conflict's resolved. |
Summary
BitonicArgSort_1024andBitonicArgSort_2048ininclude/LightGBM/cuda/cuda_algorithms.hppused the comparatorFor descending sort (
ascending == false) and equal scoresa == b, this evaluates tofalse == false → true— i.e., ties get swapped. The output index permutation for tied elements then depends on the bitonic network structure rather than input order. CPU code usesstd::stable_sort, which never swaps ties.Symptom
LambdaRank round-1 has all scores = 0. The pair-iteration gradient computation depends on the sorted index order. With non-stable sort, CUDA's pair assignments diverge from CPU's, producing different gradients on the very first round.
The remaining 0.14 is FP-precision in
atomicAdd_blockaccumulation order — documented expected behavior per lightgbm-org#6055 — and is unaffected by this change.Fix
Replace the comparator with a strict-direction form so equal elements never trigger a swap:
Same change applied to the analogous block in
BitonicArgSort_2048.Other call sites
BitonicArgSort_1024is also used by the categorical split-finder kernel (cuda_best_split_finder.cu:640) for many-vs-many splits. The CPU/CUDA parity sweep was re-run after this fix and the categorical case (and every other previously-clean case) still matches at FP epsilon — see verification below.Test plan
test_cuda_lambdarank_round1_matches_cpu_within_fp_driftintests/python_package_test/test_dual.py(gated onLIGHTGBM_TEST_CUDA=1). Asserts max|Δ| < 0.2 — strict enough to catch the regression, loose enough for the FP-precision residual.scratch/cpu_cuda_parity.pysweep on RTX 5090 / CUDA 13.2 — no regression on any case (categorical, regression, binary, multiclass, etc. all still match at FP epsilon where they did before).🤖 Generated with Claude Code