[cuda] enforce max_depth on CUDA tree learner by maxwbuckley · Pull Request #7 · BelixRogner/ExaBoost

maxwbuckley · 2026-05-09T23:44:56Z

Summary

Two related bugs caused the CUDA tree learner to silently ignore the max_depth parameter:

CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check. CPU's SerialTreeLearner::BeforeFindBestSplit (src/treelearner/serial_tree_learner.cpp:347-355) invalidates a leaf's gain when its depth has reached config_->max_depth, but the CUDA path never did the equivalent.
CUDATree::Split / SplitCategorical updated the GPU-side cuda_leaf_depth_ (via the split kernel) but never updated the host-side leaf_depth_ vector. So tree->leaf_depth(idx) always returned 0 on CUDA. Without fixing (2), even adding the check at (1) would have done nothing.

Symptom

With max_depth=2, varying num_leaves:

`num_leaves`	CPU (correct)	CUDA before fix
4	depth=2, leaves=4	depth=2, leaves=4
7	depth=2, leaves=4	depth=3, leaves=7
15	depth=2, leaves=4	depth=5, leaves=15
31	depth=2, leaves=4	depth=7, leaves=31

After the fix, CUDA caps at the requested depth (2) for every num_leaves.

Why this is a real bug, not FP-precision drift

This is a missing feature on CUDA, not the kind of "numerical precision differences caused by different degrees of parallelism" that issue lightgbm-org#6055 discusses. CUDA is silently ignoring a documented regularization parameter — any user who set max_depth on CUDA was getting models 2-7x deeper than requested, with substantially weaker regularization.

Fix

src/io/cuda/cuda_tree.cpp — mirror the host-side leaf_depth_ update in CUDATree::Split and CUDATree::SplitCategorical (matching CPU Tree::Split in include/LightGBM/tree.h:578-580).
src/treelearner/cuda/cuda_best_split_finder.{hpp,cpp} — plumb two new flags smaller_leaf_below_max_depth and larger_leaf_below_max_depth into FindBestSplitsForLeaf and AND them into the is_*_leaf_valid checks.
src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp — compute the flags at the call sites: config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth.

Test plan

Verified locally on RTX 5090 / CUDA 13.2: CUDA now caps trees at the requested max_depth for num_leaves from 4 to 31.
CPU/CUDA parity sweep — reg_max_depth case (which used max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from max|Δ|=0.25 raw_score.
Existing tests on CUDA (not yet run by author).

🤖 Generated with Claude Code

Two related bugs caused CUDA to ignore the `max_depth` parameter: 1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check. CPU's SerialTreeLearner::BeforeFindBestSplit invalidates a leaf's gain when its depth has reached config_->max_depth, but the CUDA path never did the equivalent. 2. CUDATree::Split / SplitCategorical updated the GPU-side cuda_leaf_depth_ via the launch kernel but never updated the host-side leaf_depth_ vector, so tree->leaf_depth(idx) always returned 0 on CUDA. Without (2), even adding the check at (1) would have done nothing. Symptom (max_depth=2, varying num_leaves): num_leaves= 4: cpu depth=2 leaves=4 | cuda depth=2 leaves=4 num_leaves= 7: cpu depth=2 leaves=4 | cuda depth=3 leaves=7 num_leaves=15: cpu depth=2 leaves=4 | cuda depth=5 leaves=15 num_leaves=31: cpu depth=2 leaves=4 | cuda depth=7 leaves=31 After fix, CUDA caps at the requested depth (2) for every num_leaves. Fix: * Mirror the host-side leaf_depth_ update in CUDATree::Split and CUDATree::SplitCategorical (matching CPU Tree::Split's behavior in include/LightGBM/tree.h). * Plumb a `smaller_leaf_below_max_depth` / `larger_leaf_below_max_depth` flag pair into FindBestSplitsForLeaf and AND them into the is_*_leaf_valid checks. The caller in cuda_single_gpu_tree_learner.cpp computes them as `config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth`. Verified with the cpu/cuda parity sweep: reg_max_depth case (which used max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from max|Δ|=0.25 raw_score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression coverage for the prior commit (CUDA tree learner now enforces max_depth). Two parametrized tests, gated on LIGHTGBM_TEST_CUDA=1: - test_cuda_respects_max_depth: across (max_depth, num_leaves) combinations from {1,2,3,5} x {2,4,7,31}, asserts CUDA tree depth is at most max_depth and matches CPU depth + leaf count exactly. - test_cuda_max_depth_matches_cpu_predictions: end-to-end check that 5 boosting rounds with max_depth=3 produce CPU/CUDA predictions matching at FP epsilon. Without the fix, this diverged by max|Δ|=0.47. Verified: with the prior commit reverted, 5 of 9 cases fail (those where num_leaves > 2^max_depth, i.e. where the bug actually triggered). With the fix applied, all 9 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-10T09:39:13Z

Thanks Max, and Claude Code. The symptom table (max_depth=2 producing depth-7 trees) is the kind of regression that should never have shipped — good catch. Fix is well-scoped: mirror CPU's Tree::Split for host-side leaf_depth_, plumb two flags, AND them into the existing validity check. Eight parametrized depth-cap tests plus an end-to-end raw-score equality check is solid.

Two small things before merge:

Env var convention — same as on PRs [cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6, [cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8, [cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores) #9, [cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10: align LIGHTGBM_TEST_CUDA to the existing getenv("TASK", "") != "cuda" pattern.
Existing CUDA test suite — your test plan checkbox for "Existing tests on CUDA" is unchecked. Since max_depth enforcement touches every FindBestSplitsForLeaf call, can you run the suite under TASK=cuda and confirm green before merge? Environmental skips like graphviz / the device_type=cpu-baked-in test you flagged on PR [cuda] honor min_data_per_group in categorical split kernels #4 are fine to enumerate and ignore.

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley · 2026-05-10T10:03:24Z

Both addressed.

1. Env var aligned to TASK=cuda — pushed as eaa55986.

2. Full Python test suite on CUDA — built the PR branch on RTX 5090 + CUDA 13.2 (sm_120) and ran:

TASK=cuda pytest tests/python_package_test/ --ignore=tests/python_package_test/test_dask.py
5 failed, 1111 passed, 21 skipped, 8 xfailed, 4 xpassed in 66.25s

The 5 failures are all environmental, none caused by this PR — same set as on #4:

test_engine.py::test_all_expected_params_are_written_out_to_model_text — asserts the trained model text contains [device_type: cuda] when TASK=cuda, but the test trains with no device_type param and a CUDA build still defaults to device_type=cpu. Test body is byte-identical to upstream master, so this fails the same way against origin/master — unrelated to this PR.
test_plotting.py × 4 — graphviz.backend.execute.ExecutableNotFound: failed to execute PosixPath('dot'). The dot binary isn't installed on this machine; apt install graphviz away.

The new max_depth tests all pass:

tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[1-2] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[1-7] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[2-4] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[2-7] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[2-31] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[3-7] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[3-31] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[5-31] PASSED
tests/python_package_test/test_dual.py::test_cuda_max_depth_matches_cpu_predictions PASSED
9 passed, 1 skipped in 0.55s

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #7/#8/#9/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on #6/#7/#8/#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-18T18:58:52Z

Quick nudge — the chronic Python - latest versions failure that was hitting every open PR was fixed in #15 (merged earlier today) and the propagated update-branch on this PR confirms CI is otherwise green. Only thing keeping this one from merging now is the ruff/cpplint issue on test_dual.py (and on cuda_rank_objective.cpp for #14):

PR [cuda] enforce max_depth on CUDA tree learner #7: PT006 — parametrize("max_depth,num_leaves", …) wants a tuple.
PR [cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8: I001 — imports unsorted.
PR [cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10: I001 + ruff format reformat.

Single pre-commit run --all-files && git commit --amend --no-edit && git push --force-with-lease should clear all three (and #14 if you do those together).

Let me know when pushed and I'll merge.

…PT006 fix Squashes four local iterations: drop the prediction-parity test and num_leaves parity assertion (keep only the two depth assertions), drop redundant objective=regression (default value), use tuple for parametrize argnames (ruff PT006), and shrink fixture to n=64 / 4 features / min_data_in_leaf=1 — cuts runtime ~6x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # tests/python_package_test/test_dual.py

Aligns with the existing convention used by test_engine.py's CUDA-only tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on PR #8 (and the matching note on #6, #7, #9, #10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwbuckley and others added 2 commits May 10, 2026 01:44

BelixRogner mentioned this pull request May 10, 2026

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6

Merged

4 tasks

This was referenced May 10, 2026

[cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8

Merged

[cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10

Open

BelixRogner and others added 3 commits May 10, 2026 14:33

Merge branch 'master' into cuda/max-depth-enforcement

f526da8

lint: ruff format

2ab1f9c

Merge branch 'master' into cuda/max-depth-enforcement

f784670

maxwbuckley and others added 2 commits May 18, 2026 22:22

Merge remote-tracking branch 'belix/master' into manage/max-depth-fix

46ec2b8

# Conflicts: # tests/python_package_test/test_dual.py

BelixRogner merged commit 7d7dd48 into BelixRogner:master May 19, 2026
58 of 60 checks passed

This was referenced May 19, 2026

[cuda] tolerance-based gain tie-break in best-split reductions #13

Open

[cuda] make LambdaRank gradient kernel bit-deterministic when deterministic=true #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] enforce max_depth on CUDA tree learner#7

[cuda] enforce max_depth on CUDA tree learner#7
BelixRogner merged 8 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/max-depth-enforcement

maxwbuckley commented May 9, 2026

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

BelixRogner commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented May 9, 2026

Summary

Symptom

Why this is a real bug, not FP-precision drift

Fix

Test plan

Uh oh!

BelixRogner commented May 10, 2026

Uh oh!

maxwbuckley commented May 10, 2026

Uh oh!

BelixRogner commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants