[XGBoost] Gamma testing#6039
Open
Jyothirmaikottu wants to merge 27 commits into
Open
Conversation
9085206 to
151218a
Compare
Jyothirmaikottu
commented
May 8, 2026
|
|
||
| def test_gpu_single_instance(self, image_uri, role): | ||
| hp = {**BASE_HP, "tree_method": "gpu_hist"} | ||
| hp = {**BASE_HP, "tree_method": "hist"} |
Contributor
Author
There was a problem hiding this comment.
XGBoost 3.2.0 removed gpu_hist as a tree method entirely. The tree method and device selection are now decoupled. hist is the algorithm, device picks where it runs.
c5ebbe1 to
848ed49
Compare
Jyothirmaikottu
commented
May 11, 2026
| def gpu_trained_model(image_uri, role): | ||
| """Train a GPU model once for GPU e2e tests.""" | ||
| hp = {**E2E_HP, "tree_method": "gpu_hist"} | ||
| hp = {**E2E_HP, "tree_method": "hist"} |
Contributor
Author
There was a problem hiding this comment.
XGBoost 3.2.0 removed gpu_hist as a tree method entirely. The tree method and device selection are now decoupled. hist is the algorithm, device picks where it runs.
Same Flask conflict fix as PR workflow — sagemaker-containers pins flask==1.1.1 but we need Flask==3.1.3.
The _wait_healthy() method only caught ConnectionError, so a ReadTimeout on the first /ping poll escaped the retry loop and failed the test immediately instead of retrying for 120s.
XGBoost 3.2.0 removed the 'gpu_hist' tree method. GPU training now
uses 'hist' with 'device': 'cuda'. Valid tree methods are:
{'approx', 'auto', 'exact', 'hist'}.
sagemaker_containers runs 'pip install .' without --no-build-isolation, so pip tries to fetch setuptools from PyPI which fails under network isolation. This is a container-level issue, not a test bug.
nvidia/cuda:12.9.1-base only includes driver stubs. XGBoost GPU needs libcudart.so from the runtime image.
Bumps cache-bust to pick up sagemaker-xgboost-container fix that adds 'device' to the algorithm_toolkit hyperparameter whitelist. Without this, GPU training jobs fail with 'Extraneous hyperparameter found: device'.
- Pipe mode intentionally unsupported (MLIO removed, SageMaker deprecated it) - Sparse protobuf fails with scipy 1.15 vstack on zero-feature records
…distributed - Remove 'device': 'cuda' from all e2e tests — algorithm mode rejects unknown HPs; container auto-detects GPU via SM_NUM_GPUS - Mark pipe mode tests as xfail (MLIO removed, pipe mode unsupported) - Mark container distributed tests as xfail (Rabit protocol changed) - Remove csv-pipe from benchmark parametrize - Fix generate_models workflow to use xgboost==3.2.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove 'device: cuda' from all algorithm-mode GPU e2e tests (container rejects it as extraneous HP; GPU auto-detected via SM_NUM_GPUS) - Remove csv-pipe from benchmark parametrize (pipe mode removed) - Dockerfile: use prebuilt wheel from CI artifact instead of cloning repo every build. Fallback to clone from XGBOOST_CONTAINER_BRANCH for local builds. - PR/release workflows: add build-wheel job that clones the container repo, builds the wheel, and passes it to Docker build via GitHub Actions artifacts. - Add XGBOOST_CONTAINER_BRANCH env for branch testing.
…n AL2023 repo yet)
bcdc8f5 to
6b601f3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Test Plan
Test Result
Toggle if you are merging into master Branch
By default, docker image builds and tests are disabled. Two ways to run builds and tests:
How to use the helper utility for updating dlc_developer_config.toml
Assuming your remote is called
origin(you can find out more withgit remote -v)...python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp originpython src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp originpython src/prepare_dlc_dev_environment.py -rcp originNOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:
sagemaker_remote_tests = truesagemaker_efa_tests = truesagemaker_rc_tests = truesagemaker_local_tests = trueHow to use PR description
Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:# /buildspec <buildspec_path># /buildspec pytorch/training/buildspec.yml# /tests <test_list># /tests sanity security ec2sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.Toggle if you are merging into main Branch
PR Checklist
pre-commit run --all-fileslocally before creating this PR. (Read DEVELOPMENT.md for details).