Skip to content

[XGBoost] Gamma testing#6039

Open
Jyothirmaikottu wants to merge 27 commits into
mainfrom
xgboost-promote-preprod-to-gamma
Open

[XGBoost] Gamma testing#6039
Jyothirmaikottu wants to merge 27 commits into
mainfrom
xgboost-promote-preprod-to-gamma

Conversation

@Jyothirmaikottu
Copy link
Copy Markdown
Contributor

Purpose

Test Plan

Test Result


Toggle if you are merging into master Branch

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>
Toggle if you are merging into main Branch

PR Checklist

  • [] I ran pre-commit run --all-files locally before creating this PR. (Read DEVELOPMENT.md for details).

@Jyothirmaikottu Jyothirmaikottu force-pushed the xgboost-promote-preprod-to-gamma branch 3 times, most recently from 9085206 to 151218a Compare May 8, 2026 17:24

def test_gpu_single_instance(self, image_uri, role):
hp = {**BASE_HP, "tree_method": "gpu_hist"}
hp = {**BASE_HP, "tree_method": "hist"}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGBoost 3.2.0 removed gpu_hist as a tree method entirely. The tree method and device selection are now decoupled. hist is the algorithm, device picks where it runs.

@Jyothirmaikottu Jyothirmaikottu force-pushed the xgboost-promote-preprod-to-gamma branch 2 times, most recently from c5ebbe1 to 848ed49 Compare May 11, 2026 20:42
def gpu_trained_model(image_uri, role):
"""Train a GPU model once for GPU e2e tests."""
hp = {**E2E_HP, "tree_method": "gpu_hist"}
hp = {**E2E_HP, "tree_method": "hist"}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGBoost 3.2.0 removed gpu_hist as a tree method entirely. The tree method and device selection are now decoupled. hist is the algorithm, device picks where it runs.

Jyothirmaikottu and others added 22 commits May 11, 2026 22:02
Same Flask conflict fix as PR workflow — sagemaker-containers pins
flask==1.1.1 but we need Flask==3.1.3.
The _wait_healthy() method only caught ConnectionError, so a
ReadTimeout on the first /ping poll escaped the retry loop and
failed the test immediately instead of retrying for 120s.
XGBoost 3.2.0 removed the 'gpu_hist' tree method. GPU training now
uses 'hist' with 'device': 'cuda'. Valid tree methods are:
{'approx', 'auto', 'exact', 'hist'}.
sagemaker_containers runs 'pip install .' without --no-build-isolation,
so pip tries to fetch setuptools from PyPI which fails under network
isolation. This is a container-level issue, not a test bug.
nvidia/cuda:12.9.1-base only includes driver stubs. XGBoost GPU
needs libcudart.so from the runtime image.
Bumps cache-bust to pick up sagemaker-xgboost-container fix that adds
'device' to the algorithm_toolkit hyperparameter whitelist. Without this,
GPU training jobs fail with 'Extraneous hyperparameter found: device'.
- Pipe mode intentionally unsupported (MLIO removed, SageMaker deprecated it)
- Sparse protobuf fails with scipy 1.15 vstack on zero-feature records
…distributed

- Remove 'device': 'cuda' from all e2e tests — algorithm mode rejects
  unknown HPs; container auto-detects GPU via SM_NUM_GPUS
- Mark pipe mode tests as xfail (MLIO removed, pipe mode unsupported)
- Mark container distributed tests as xfail (Rabit protocol changed)
- Remove csv-pipe from benchmark parametrize
- Fix generate_models workflow to use xgboost==3.2.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove 'device: cuda' from all algorithm-mode GPU e2e tests (container
  rejects it as extraneous HP; GPU auto-detected via SM_NUM_GPUS)
- Remove csv-pipe from benchmark parametrize (pipe mode removed)
- Dockerfile: use prebuilt wheel from CI artifact instead of cloning repo
  every build. Fallback to clone from XGBOOST_CONTAINER_BRANCH for local builds.
- PR/release workflows: add build-wheel job that clones the container repo,
  builds the wheel, and passes it to Docker build via GitHub Actions artifacts.
- Add XGBOOST_CONTAINER_BRANCH env for branch testing.
@Jyothirmaikottu Jyothirmaikottu force-pushed the xgboost-promote-preprod-to-gamma branch from bcdc8f5 to 6b601f3 Compare May 11, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant