Skip to content

[cuda] make BitonicArgSort_1024 / _2048 stable on tied values#10

Open
maxwbuckley wants to merge 8 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/bitonic-argsort-tie-stability
Open

[cuda] make BitonicArgSort_1024 / _2048 stable on tied values#10
maxwbuckley wants to merge 8 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/bitonic-argsort-tie-stability

Conversation

@maxwbuckley
Copy link
Copy Markdown

Summary

BitonicArgSort_1024 and BitonicArgSort_2048 in include/LightGBM/cuda/cuda_algorithms.hpp used the comparator

if ((scores[indices[a]] > scores[indices[b]]) == ascending) { /* swap */ }

For descending sort (ascending == false) and equal scores a == b, this evaluates to false == false → true — i.e., ties get swapped. The output index permutation for tied elements then depends on the bitonic network structure rather than input order. CPU code uses std::stable_sort, which never swaps ties.

Symptom

LambdaRank round-1 has all scores = 0. The pair-iteration gradient computation depends on the sorted index order. With non-stable sort, CUDA's pair assignments diverge from CPU's, producing different gradients on the very first round.

Test: 10 queries × 20 items, lambdarank
Before this fix:  CPU/CUDA max|Δ| raw_score round 1 = 0.29
After this fix:   CPU/CUDA max|Δ| raw_score round 1 = 0.14

The remaining 0.14 is FP-precision in atomicAdd_block accumulation order — documented expected behavior per lightgbm-org#6055 — and is unaffected by this change.

Fix

Replace the comparator with a strict-direction form so equal elements never trigger a swap:

-          if ((scores[indices[threadIdx.x]] > scores[indices[index_to_compare]]) == ascending) {
+          const bool need_swap = ASCENDING
+              ? (scores[indices[threadIdx.x]] > scores[indices[index_to_compare]])
+              : (scores[indices[threadIdx.x]] < scores[indices[index_to_compare]]);
+          if (need_swap) {
             ...
           }

Same change applied to the analogous block in BitonicArgSort_2048.

Other call sites

BitonicArgSort_1024 is also used by the categorical split-finder kernel (cuda_best_split_finder.cu:640) for many-vs-many splits. The CPU/CUDA parity sweep was re-run after this fix and the categorical case (and every other previously-clean case) still matches at FP epsilon — see verification below.

Test plan

  • Added test_cuda_lambdarank_round1_matches_cpu_within_fp_drift in tests/python_package_test/test_dual.py (gated on LIGHTGBM_TEST_CUDA=1). Asserts max|Δ| < 0.2 — strict enough to catch the regression, loose enough for the FP-precision residual.
  • Re-ran the full scratch/cpu_cuda_parity.py sweep on RTX 5090 / CUDA 13.2 — no regression on any case (categorical, regression, binary, multiclass, etc. all still match at FP epsilon where they did before).

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits May 10, 2026 06:45
The two device-side bitonic sorts used the comparator
  (scores[a] > scores[b]) == ascending
which evaluates to true when scores[a] == scores[b] and ascending=false
(because false == false), causing ties to be swapped during a
descending sort. The output index permutation for tied values then
depended on the network structure rather than the input order.

CPU code paths use std::stable_sort, which never swaps equal elements.
Aligning CUDA's behavior closes the LambdaRank round-1 divergence
where all scores are identically zero and the resulting sort
permutation determines which document pairs accumulate gradient first.

Symptom on a 10-query, 20-items-per-query dataset (round 1, all scores
== 0):
  before:  CPU/CUDA max|Δ| raw_score = 0.29
  after:   CPU/CUDA max|Δ| raw_score = 0.14

(The remaining round-1 divergence is FP-precision in atomicAdd_block
order across pairs, which is documented "expected" by upstream
maintainers in lightgbm-org#6055 and is unaffected by this
change.)

Fix: replace `(a > b) == ascending` with the strict-direction form
  ASCENDING ? (a > b) : (a < b)
in both BitonicArgSort_1024 and the equivalent block in
BitonicArgSort_2048's outer ascending/descending loops.

Verified with the CPU/CUDA parity sweep: no regression on any of the
previously-clean cases (categorical kernel still uses BitonicArgSort_1024
in the many-vs-many split path; that case still matches at FP epsilon).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies that CUDA LambdaRank's round-1 predictions stay within ~0.2 of
CPU on a small synthetic dataset where all initial scores are zero.

Without the BitonicArgSort tie-stability fix in the prior commit, this
case diverged by ~0.29; with the fix it drops to ~0.14 (FP-precision
residual from pair-gradient atomicAdd ordering, expected per
lightgbm-org#6055).

The 0.2 threshold catches the bitonic-sort regression while tolerating
the FP-precision residual.

Gated on LIGHTGBM_TEST_CUDA=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Thanks Max, and Claude Code. The motivation is right — (a > b) == ascending swapping ties when ascending == false is a real bug, and lambdarank round-1 with all-zero scores is the perfect case to surface it.

But I want to flag what I think is a regression in the BitonicArgSort_1024 half of the fix before merging.

The two halves of the diff disagree on which variable to consult. In _2048, the comparator (correctly) reads lowercase ascending — the per-pass local. In _1024, the comparator was switched to uppercase ASCENDING — the template parameter. They should be the same variable.

Look at the upstream _1024 body around line 240:

template <..., bool ASCENDING>
__device__ ... BitonicArgSort_1024(...) {
  for (outer_depth ...) {
    ...
    const bool ascending = ASCENDING ? (outer_segment_index % 2 == 0) : (outer_segment_index % 2 > 0);
    //          ^^^^^^^^^^^^^^^^^^^^^^^ per-pass local, NOT the template parameter
    for (inner_depth ...) {
      if ((... > ...) == ascending) { swap }
      //                  ^^^^^^^^^ uses the per-pass local

That's textbook bitonic sort: outer phases alternate direction so the inner merge can pull a bitonic sequence into a monotone one. The comparator has to consult per-pass ascending, not the global ASCENDING.

PR #10's _1024 fix replaces the comparator with:

const bool need_swap = ASCENDING  // ← uppercase, template param
    ? (scores[...] > scores[...])
    : (scores[...] < scores[...]);

I believe that breaks the bitonic merge for non-tied inputs. The reason your lambdarank test passes despite the breakage is exactly the same reason the original bug only surfaced there: round-1 scores are all zero, so the strict comparator returns false regardless of direction — no swaps, no exposure of the sort being wrong.

The _2048 change in the same diff uses lowercase ascending and looks correct.

BitonicArgSort_1024 is also called by the categorical split-finder at cuda_best_split_finder.cu:640 with real, distinct scores. If _1024's comparator no longer alternates direction by pass, that sort should produce wrong orderings on inputs that aren't all-equal. Your parity sweep showed no regression on the categorical case — I'd guess the call happens to use ASCENDING == true and the relevant per-pass ascending is also true on the inputs you tested, i.e. they coincide by luck on small inputs.

Suggested fix for _1024 — symmetric with what you have in _2048:

const bool need_swap = ascending  // lowercase, the per-pass local
    ? (scores[indices[threadIdx.x]] > scores[indices[index_to_compare]])
    : (scores[indices[threadIdx.x]] < scores[indices[index_to_compare]]);

A test with non-trivial input — e.g. a categorical-split sort with all-distinct, well-separated scores, asserting CPU and CUDA agree on the resulting permutation — would also catch this case that lambdarank round-1 can't.

Happy to be wrong here if you can show me an input that exercises _1024's alternating-direction passes and validates the output is correct with the ASCENDING-only comparator. On read though I think the per-pass ascending is what we need.

Same env var convention note as #6/#7/#8/#9 — align to TASK=cuda while you're updating.

…v var; cover non-tied input

Felix's review on lightgbm-org#10 caught a regression in the BitonicArgSort_1024
half of the original tie-stability fix: I had switched the comparator to
read the template parameter `ASCENDING` (the global sort direction)
instead of the per-pass local `ascending` (which alternates by outer
segment index for the bitonic merge). The all-tied LambdaRank round-1
test passed either way because strict comparison returns false for
equal scores -- so the broken comparator never made a wrong swap on
that input -- but for non-tied inputs the bitonic merge stops pulling
a bitonic sequence into a monotone one and the sort silently produces
wrong orderings.

The _2048 half of the diff was already correct (uses per-pass
`ascending`), so this is symmetric: switch _1024 back to `ascending`
to match.

Also adds test_cuda_bitonic_argsort_1024_with_distinct_scores_matches_cpu
to test_dual.py: trains a regression on a single categorical feature
with distinct, well-separated per-category sums, which exercises
BitonicArgSort_1024 over non-tied scores via the categorical
split-finder. Asserts CPU and CUDA agree on predictions after one
boosting round; the wrong-comparator case would diverge.

Aligns the env var gate to TASK=cuda to match test_engine.py's
existing convention (same change requested on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#9).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Author

Good catch — you're right. Pushed f47e0499 with three changes:

  1. BitonicArgSort_1024 comparator now reads the per-pass ascending local, not the template ASCENDING. The fix is now symmetric with _2048. Your read of why the lambdarank test passed despite the breakage was exactly correct: round-1 all-zero scores meant the strict comparator returned false either way, so the broken comparator never made a wrong swap on that input.

  2. Added test_cuda_bitonic_argsort_1024_with_distinct_scores_matches_cpu in test_dual.py — trains a regression on a single categorical feature with distinct, well-separated per-category gradient/hessian sums, which exercises BitonicArgSort_1024 over non-tied scores via the categorical split-finder at cuda_best_split_finder.cu:640. Asserts CPU and CUDA agree on predictions after one round; the wrong-comparator case would diverge.

  3. Env var aligned to TASK=cuda.

BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #7/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#7/#8/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Quick nudge — the chronic Python - latest versions failure that was hitting every open PR was fixed in #15 (merged earlier today) and the propagated update-branch on this PR confirms CI is otherwise green. Only thing keeping this one from merging now is the ruff/cpplint issue on test_dual.py (and on cuda_rank_objective.cpp for #14):

Single pre-commit run --all-files && git commit --amend --no-edit && git push --force-with-lease should clear all three (and #14 if you do those together).

Let me know when pushed and I'll merge.

…-tie-stability

# Conflicts:
#	tests/python_package_test/test_dual.py
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR #8 (and the matching note on #6, #7, #9, #10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Quick rebase nudge — #7 and #8 just landed on master and touched files this PR also modifies (cuda_algorithms.hpp / cuda_best_split_finder.cu / etc.), so this branch now shows a merge conflict on GitHub. One more git merge master && git push round should clear it; CI is otherwise green (the apparent failures yesterday were all environmental — dask socket flakes, a cancelled job rolling up, and a Boost-headers wheel-build issue, none of which were touching the actual PR content).

Ready to merge as soon as the conflict's resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants