[cuda] enforce max_depth on CUDA tree learner#7
Conversation
Two related bugs caused CUDA to ignore the `max_depth` parameter: 1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check. CPU's SerialTreeLearner::BeforeFindBestSplit invalidates a leaf's gain when its depth has reached config_->max_depth, but the CUDA path never did the equivalent. 2. CUDATree::Split / SplitCategorical updated the GPU-side cuda_leaf_depth_ via the launch kernel but never updated the host-side leaf_depth_ vector, so tree->leaf_depth(idx) always returned 0 on CUDA. Without (2), even adding the check at (1) would have done nothing. Symptom (max_depth=2, varying num_leaves): num_leaves= 4: cpu depth=2 leaves=4 | cuda depth=2 leaves=4 num_leaves= 7: cpu depth=2 leaves=4 | cuda depth=3 leaves=7 num_leaves=15: cpu depth=2 leaves=4 | cuda depth=5 leaves=15 num_leaves=31: cpu depth=2 leaves=4 | cuda depth=7 leaves=31 After fix, CUDA caps at the requested depth (2) for every num_leaves. Fix: * Mirror the host-side leaf_depth_ update in CUDATree::Split and CUDATree::SplitCategorical (matching CPU Tree::Split's behavior in include/LightGBM/tree.h). * Plumb a `smaller_leaf_below_max_depth` / `larger_leaf_below_max_depth` flag pair into FindBestSplitsForLeaf and AND them into the is_*_leaf_valid checks. The caller in cuda_single_gpu_tree_learner.cpp computes them as `config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth`. Verified with the cpu/cuda parity sweep: reg_max_depth case (which used max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from max|Δ|=0.25 raw_score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit (CUDA tree learner now
enforces max_depth). Two parametrized tests, gated on
LIGHTGBM_TEST_CUDA=1:
- test_cuda_respects_max_depth: across (max_depth, num_leaves)
combinations from {1,2,3,5} x {2,4,7,31}, asserts CUDA tree depth
is at most max_depth and matches CPU depth + leaf count exactly.
- test_cuda_max_depth_matches_cpu_predictions: end-to-end check
that 5 boosting rounds with max_depth=3 produce CPU/CUDA
predictions matching at FP epsilon. Without the fix, this
diverged by max|Δ|=0.47.
Verified: with the prior commit reverted, 5 of 9 cases fail
(those where num_leaves > 2^max_depth, i.e. where the bug actually
triggered). With the fix applied, all 9 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks Max, and Claude Code. The symptom table ( Two small things before merge:
|
Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both addressed. 1. Env var aligned to 2. Full Python test suite on CUDA — built the PR branch on RTX 5090 + CUDA 13.2 (sm_120) and ran: The 5 failures are all environmental, none caused by this PR — same set as on #4:
The new max_depth tests all pass: |
|
Quick nudge — the chronic
Single Let me know when pushed and I'll merge. |
…PT006 fix Squashes four local iterations: drop the prediction-parity test and num_leaves parity assertion (keep only the two depth assertions), drop redundant objective=regression (default value), use tuple for parametrize argnames (ruff PT006), and shrink fixture to n=64 / 4 features / min_data_in_leaf=1 — cuts runtime ~6x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # tests/python_package_test/test_dual.py
Summary
Two related bugs caused the CUDA tree learner to silently ignore the
max_depthparameter:CUDABestSplitFinder::FindBestSplitsForLeafhad no max_depth check. CPU'sSerialTreeLearner::BeforeFindBestSplit(src/treelearner/serial_tree_learner.cpp:347-355) invalidates a leaf's gain when its depth has reachedconfig_->max_depth, but the CUDA path never did the equivalent.CUDATree::Split/SplitCategoricalupdated the GPU-sidecuda_leaf_depth_(via the split kernel) but never updated the host-sideleaf_depth_vector. Sotree->leaf_depth(idx)always returned 0 on CUDA. Without fixing (2), even adding the check at (1) would have done nothing.Symptom
With
max_depth=2, varyingnum_leaves:num_leavesAfter the fix, CUDA caps at the requested depth (2) for every
num_leaves.Why this is a real bug, not FP-precision drift
This is a missing feature on CUDA, not the kind of "numerical precision differences caused by different degrees of parallelism" that issue lightgbm-org#6055 discusses. CUDA is silently ignoring a documented regularization parameter — any user who set
max_depthon CUDA was getting models 2-7x deeper than requested, with substantially weaker regularization.Fix
src/io/cuda/cuda_tree.cpp— mirror the host-sideleaf_depth_update inCUDATree::SplitandCUDATree::SplitCategorical(matching CPUTree::Splitininclude/LightGBM/tree.h:578-580).src/treelearner/cuda/cuda_best_split_finder.{hpp,cpp}— plumb two new flagssmaller_leaf_below_max_depthandlarger_leaf_below_max_depthintoFindBestSplitsForLeafand AND them into theis_*_leaf_validchecks.src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp— compute the flags at the call sites:config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth.Test plan
max_depthfornum_leavesfrom 4 to 31.reg_max_depthcase (which usedmax_depth=3withnum_leaves=7) now matches CPU at FP epsilon, down frommax|Δ|=0.25raw_score.🤖 Generated with Claude Code