Skip to content

ci: runtime_env tests exhaust /tmp on CPU runners due to Ray venv cloning #1775

@suiyoubi

Description

@suiyoubi

Problem

Tests that use per-stage runtime_env (pip/uv package overrides) cannot run on GitHub-hosted CPU runners because they exhaust /tmp disk space.

Root cause: For each unique runtime_env spec, Ray clones the entire .venv into /tmp via shutil.copytree (see ray/_private/runtime_env/_clonevirtualenv.py). Because the venv includes NVIDIA CUDA libraries (libcudnn_engines_precompiled.so.9, libnccl.so.2, etc.), each clone is ~700 MB+. With 3 unique specs the test suite needs ~2.1 GB in /tmp, which exhausts available space on CPU runners.

Observed error:
```
shutil.Error: [Errno 28] No space left on device:
libcudnn_engines_precompiled.so.9 →
/tmp/pytest-of-runner/pytest-0/ray0/session_.../runtime_resources/uv/.../virtualenv/...
```

Workaround: The affected tests (tests/pipelines/test_per_stage_runtime_env.py, tests/pipelines/test_runtime_env_advanced.py) are currently marked @pytest.mark.gpu so they run only on GPU runners, which have more available disk. See PR #1623.

Fix options

  1. Point Ray's temp dir to the workspace — pass --basetemp=$GITHUB_WORKSPACE/pytest-tmp to pytest so tmp_path_factory.mktemp("ray") (used as ray start --temp-dir) lands on the larger /home/runner/work partition instead of /tmp.

  2. Exclude NVIDIA libs from the venv clone — Ray's _clonevirtualenv.py uses shutil.copytree(..., ignore=shutil.ignore_patterns("*.pyc")). Patching this (or the virtualenv_utils.py caller) to also ignore nvidia/ packages would eliminate the bulk of the clone size.

  3. Use a CPU-only venv for CI — exclude nvidia-* packages from the install. Bigger change but permanently avoids the issue.

Option 1 is the least invasive and can be done entirely in .github/workflows/cicd-main.yml.

Acceptance criteria

  • tests/pipelines/test_per_stage_runtime_env.py and tests/pipelines/test_runtime_env_advanced.py pass in CPU CI
  • @pytest.mark.gpu markers removed from those files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions