Skip to content

[cuda] enforce max_depth on CUDA tree learner#7

Merged
BelixRogner merged 8 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/max-depth-enforcement
May 19, 2026
Merged

[cuda] enforce max_depth on CUDA tree learner#7
BelixRogner merged 8 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/max-depth-enforcement

Conversation

@maxwbuckley
Copy link
Copy Markdown

Summary

Two related bugs caused the CUDA tree learner to silently ignore the max_depth parameter:

  1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check. CPU's SerialTreeLearner::BeforeFindBestSplit (src/treelearner/serial_tree_learner.cpp:347-355) invalidates a leaf's gain when its depth has reached config_->max_depth, but the CUDA path never did the equivalent.

  2. CUDATree::Split / SplitCategorical updated the GPU-side cuda_leaf_depth_ (via the split kernel) but never updated the host-side leaf_depth_ vector. So tree->leaf_depth(idx) always returned 0 on CUDA. Without fixing (2), even adding the check at (1) would have done nothing.

Symptom

With max_depth=2, varying num_leaves:

num_leaves CPU (correct) CUDA before fix
4 depth=2, leaves=4 depth=2, leaves=4
7 depth=2, leaves=4 depth=3, leaves=7
15 depth=2, leaves=4 depth=5, leaves=15
31 depth=2, leaves=4 depth=7, leaves=31

After the fix, CUDA caps at the requested depth (2) for every num_leaves.

Why this is a real bug, not FP-precision drift

This is a missing feature on CUDA, not the kind of "numerical precision differences caused by different degrees of parallelism" that issue lightgbm-org#6055 discusses. CUDA is silently ignoring a documented regularization parameter — any user who set max_depth on CUDA was getting models 2-7x deeper than requested, with substantially weaker regularization.

Fix

  • src/io/cuda/cuda_tree.cpp — mirror the host-side leaf_depth_ update in CUDATree::Split and CUDATree::SplitCategorical (matching CPU Tree::Split in include/LightGBM/tree.h:578-580).
  • src/treelearner/cuda/cuda_best_split_finder.{hpp,cpp} — plumb two new flags smaller_leaf_below_max_depth and larger_leaf_below_max_depth into FindBestSplitsForLeaf and AND them into the is_*_leaf_valid checks.
  • src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp — compute the flags at the call sites: config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth.

Test plan

  • Verified locally on RTX 5090 / CUDA 13.2: CUDA now caps trees at the requested max_depth for num_leaves from 4 to 31.
  • CPU/CUDA parity sweep — reg_max_depth case (which used max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from max|Δ|=0.25 raw_score.
  • Existing tests on CUDA (not yet run by author).

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits May 10, 2026 01:44
Two related bugs caused CUDA to ignore the `max_depth` parameter:

1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check.
   CPU's SerialTreeLearner::BeforeFindBestSplit invalidates a leaf's
   gain when its depth has reached config_->max_depth, but the CUDA
   path never did the equivalent.

2. CUDATree::Split / SplitCategorical updated the GPU-side
   cuda_leaf_depth_ via the launch kernel but never updated the
   host-side leaf_depth_ vector, so tree->leaf_depth(idx) always
   returned 0 on CUDA. Without (2), even adding the check at (1)
   would have done nothing.

Symptom (max_depth=2, varying num_leaves):

  num_leaves= 4: cpu depth=2 leaves=4 | cuda depth=2 leaves=4
  num_leaves= 7: cpu depth=2 leaves=4 | cuda depth=3 leaves=7
  num_leaves=15: cpu depth=2 leaves=4 | cuda depth=5 leaves=15
  num_leaves=31: cpu depth=2 leaves=4 | cuda depth=7 leaves=31

After fix, CUDA caps at the requested depth (2) for every num_leaves.

Fix:

* Mirror the host-side leaf_depth_ update in CUDATree::Split and
  CUDATree::SplitCategorical (matching CPU Tree::Split's behavior in
  include/LightGBM/tree.h).
* Plumb a `smaller_leaf_below_max_depth` / `larger_leaf_below_max_depth`
  flag pair into FindBestSplitsForLeaf and AND them into the
  is_*_leaf_valid checks. The caller in
  cuda_single_gpu_tree_learner.cpp computes them as
  `config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth`.

Verified with the cpu/cuda parity sweep: reg_max_depth case (which used
max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from
max|Δ|=0.25 raw_score.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit (CUDA tree learner now
enforces max_depth). Two parametrized tests, gated on
LIGHTGBM_TEST_CUDA=1:

- test_cuda_respects_max_depth: across (max_depth, num_leaves)
  combinations from {1,2,3,5} x {2,4,7,31}, asserts CUDA tree depth
  is at most max_depth and matches CPU depth + leaf count exactly.
- test_cuda_max_depth_matches_cpu_predictions: end-to-end check
  that 5 boosting rounds with max_depth=3 produce CPU/CUDA
  predictions matching at FP epsilon. Without the fix, this
  diverged by max|Δ|=0.47.

Verified: with the prior commit reverted, 5 of 9 cases fail
(those where num_leaves > 2^max_depth, i.e. where the bug actually
triggered). With the fix applied, all 9 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Thanks Max, and Claude Code. The symptom table (max_depth=2 producing depth-7 trees) is the kind of regression that should never have shipped — good catch. Fix is well-scoped: mirror CPU's Tree::Split for host-side leaf_depth_, plumb two flags, AND them into the existing validity check. Eight parametrized depth-cap tests plus an end-to-end raw-score equality check is solid.

Two small things before merge:

  1. Env var convention — same as on PRs [cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6, [cuda] fix illegal-memory-access crash in weighted L1 / quantile training #8, [cuda] fix unweighted percentile formula in PercentileGlobalKernel (init scores) #9, [cuda] make BitonicArgSort_1024 / _2048 stable on tied values #10: align LIGHTGBM_TEST_CUDA to the existing getenv("TASK", "") != "cuda" pattern.
  2. Existing CUDA test suite — your test plan checkbox for "Existing tests on CUDA" is unchecked. Since max_depth enforcement touches every FindBestSplitsForLeaf call, can you run the suite under TASK=cuda and confirm green before merge? Environmental skips like graphviz / the device_type=cpu-baked-in test you flagged on PR [cuda] honor min_data_per_group in categorical split kernels #4 are fine to enumerate and ignore.

Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Author

Both addressed.

1. Env var aligned to TASK=cuda — pushed as eaa55986.

2. Full Python test suite on CUDA — built the PR branch on RTX 5090 + CUDA 13.2 (sm_120) and ran:

TASK=cuda pytest tests/python_package_test/ --ignore=tests/python_package_test/test_dask.py
5 failed, 1111 passed, 21 skipped, 8 xfailed, 4 xpassed in 66.25s

The 5 failures are all environmental, none caused by this PR — same set as on #4:

  • test_engine.py::test_all_expected_params_are_written_out_to_model_text — asserts the trained model text contains [device_type: cuda] when TASK=cuda, but the test trains with no device_type param and a CUDA build still defaults to device_type=cpu. Test body is byte-identical to upstream master, so this fails the same way against origin/master — unrelated to this PR.
  • test_plotting.py × 4 — graphviz.backend.execute.ExecutableNotFound: failed to execute PosixPath('dot'). The dot binary isn't installed on this machine; apt install graphviz away.

The new max_depth tests all pass:

tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[1-2] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[1-7] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[2-4] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[2-7] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[2-31] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[3-7] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[3-31] PASSED
tests/python_package_test/test_dual.py::test_cuda_respects_max_depth[5-31] PASSED
tests/python_package_test/test_dual.py::test_cuda_max_depth_matches_cpu_predictions PASSED
9 passed, 1 skipped in 0.55s

BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #7/#8/#9/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request May 18, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on #6/#7/#8/#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Quick nudge — the chronic Python - latest versions failure that was hitting every open PR was fixed in #15 (merged earlier today) and the propagated update-branch on this PR confirms CI is otherwise green. Only thing keeping this one from merging now is the ruff/cpplint issue on test_dual.py (and on cuda_rank_objective.cpp for #14):

Single pre-commit run --all-files && git commit --amend --no-edit && git push --force-with-lease should clear all three (and #14 if you do those together).

Let me know when pushed and I'll merge.

maxwbuckley and others added 2 commits May 18, 2026 22:22
…PT006 fix

Squashes four local iterations: drop the prediction-parity test and
num_leaves parity assertion (keep only the two depth assertions), drop
redundant objective=regression (default value), use tuple for
parametrize argnames (ruff PT006), and shrink fixture to n=64 / 4
features / min_data_in_leaf=1 — cuts runtime ~6x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	tests/python_package_test/test_dual.py
BelixRogner pushed a commit that referenced this pull request May 19, 2026
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR #8 (and the matching note on #6, #7, #9, #10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner BelixRogner merged commit 7d7dd48 into BelixRogner:master May 19, 2026
58 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants