CI: add standalone ATOM integration workflow#428
Conversation
Add a dedicated FlyDSL workflow to validate ATOM against the latest atom-dev image with aiter reinstalled from main and the current FlyDSL commit. This keeps downstream accuracy coverage separate from the existing FlyDSL CI while exercising the same DeepSeek-R1 and GPT-OSS models used by aiter CI.
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated GitHub Actions workflow to run ATOM downstream accuracy coverage inside rocm/atom-dev:latest, plus a helper script to launch/stop the ATOM OpenAI-compatible server and run lm_eval against it.
Changes:
- Add
scripts/run_atom_downstream.shto launch/warm up/stop the ATOM server and run a GSM8K accuracy check vialm_eval. - Add
.github/workflows/flydsl-atom-integration.yamlto run ATOM accuracy on two models, cache an MLIR install tarball, and upload logs/results as artifacts. - Reinstall
aiterfrommainand install FlyDSL from the checked-out PR commit inside the container before running validation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
scripts/run_atom_downstream.sh |
Adds server lifecycle + accuracy evaluation script used by the workflow. |
.github/workflows/flydsl-atom-integration.yaml |
New standalone CI workflow to run ATOM accuracy coverage in a ROCm container, with caching + artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bash scripts/build_llvm.sh | ||
| ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir | ||
| " | ||
| docker cp "${CONTAINER_NAME}:/workspace/llvm-project/build-flydsl/mlir_install.tgz" ./mlir_install.tgz | ||
|
|
There was a problem hiding this comment.
This docker cp path assumes scripts/build_llvm.sh produced /workspace/llvm-project/build-flydsl/mlir_install.tgz, but the script's default output is /workspace/llvm-project/mlir_install.tgz (and the build dir is buildmlir). With the current paths, the tarball copy (and therefore caching) will fail.
|
|
||
| echo '=== Installing FlyDSL from the checked out PR commit ===' | ||
| pip uninstall -y flydsl || true | ||
| cd /workspace/flydsl-test |
There was a problem hiding this comment.
pip install -e . will invoke flir/build.sh (via setup.py) if the embedded MLIR runtime is missing, and flir/build.sh fails unless MLIR_PATH is set or ${REPO_ROOT}/llvm-project/mlir_install exists. In this workflow, LLVM/MLIR is built/restored under /workspace/llvm-project/..., so you likely need to export MLIR_PATH (or build into the repo-local flydsl-test/llvm-project/) before installing FlyDSL.
| cd /workspace/flydsl-test | |
| cd /workspace/flydsl-test | |
| if [ -d /workspace/llvm-project/mlir_install ]; then | |
| export MLIR_PATH=/workspace/llvm-project/mlir_install | |
| elif [ -d /workspace/flydsl-test/llvm-project/mlir_install ]; then | |
| export MLIR_PATH=/workspace/flydsl-test/llvm-project/mlir_install | |
| else | |
| echo 'ERROR: MLIR install not found. Expected /workspace/llvm-project/mlir_install or /workspace/flydsl-test/llvm-project/mlir_install.' >&2 | |
| exit 1 | |
| fi | |
| echo "Using MLIR_PATH=${MLIR_PATH}" |
| if curl -sf "${ATOM_BASE_URL}/health" -o /dev/null; then | ||
| echo "ATOM server HTTP endpoint is ready." | ||
| server_up=true | ||
| break | ||
| fi | ||
|
|
||
| echo "Waiting for ATOM server... (${i}/${max_retries})" | ||
| sleep "${retry_interval}" |
There was a problem hiding this comment.
curl in the readiness loop has no connect/overall timeout. If the TCP connect hangs (or localhost DNS/stack is wedged), the script can block indefinitely inside wait_for_server rather than retrying. Add --connect-timeout/--max-time so each attempt fails fast and the retry logic works as intended.
|
|
||
| lm_eval \ | ||
| --model local-completions \ | ||
| --model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \ |
There was a problem hiding this comment.
trust_remote_code=True allows executing arbitrary code from the model repository during evaluation, which is a supply-chain risk in CI. If this is required for these models, consider pinning model revisions (or a vetted local snapshot) and limiting trust_remote_code to only the cases that need it.
| --model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \ | |
| --model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False \ |
| uses: actions/cache@v4 | ||
| with: | ||
| path: mlir_install.tgz | ||
| key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }} |
There was a problem hiding this comment.
The cache key hashes flydsl-test/cmake/llvm-hash.txt, but this repo doesn't have a cmake/ directory (so hashFiles will treat it as missing). This makes the key misleading and may reduce cache invalidation when LLVM-related inputs change; consider removing it or hashing an existing file that captures the intended LLVM/MLIR version input.
| key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }} | |
| key: flydsl-mlir-install-${{ hashFiles('flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }} |
| -e HF_TOKEN="${secrets.HF_TOKEN:-}" \ | ||
| --name "${CONTAINER_NAME}" \ | ||
| "${{ env.ATOM_BASE_IMAGE }}" | ||
| env: | ||
| GITHUB_WORKSPACE: ${{ github.workspace }} | ||
| MODEL_ENV_VARS: ${{ matrix.env_vars }} |
There was a problem hiding this comment.
The HF_TOKEN value here uses bash-style parameter expansion with a dotted variable name (secrets.HF_TOKEN), which will cause a bad substitution error under set -euo pipefail and prevent the container from starting. Pass the secret using GitHub Actions expression syntax (e.g. via env: + -e HF_TOKEN=...) instead of ${secrets.HF_TOKEN:-}.
| -e HF_TOKEN="${secrets.HF_TOKEN:-}" \ | |
| --name "${CONTAINER_NAME}" \ | |
| "${{ env.ATOM_BASE_IMAGE }}" | |
| env: | |
| GITHUB_WORKSPACE: ${{ github.workspace }} | |
| MODEL_ENV_VARS: ${{ matrix.env_vars }} | |
| -e HF_TOKEN="${HF_TOKEN:-}" \ | |
| --name "${CONTAINER_NAME}" \ | |
| "${{ env.ATOM_BASE_IMAGE }}" | |
| env: | |
| GITHUB_WORKSPACE: ${{ github.workspace }} | |
| MODEL_ENV_VARS: ${{ matrix.env_vars }} | |
| HF_TOKEN: ${{ secrets.HF_TOKEN }} |
| rm -rf /workspace/llvm-project/build-flydsl/mlir_install | ||
| mkdir -p /workspace/llvm-project/build-flydsl | ||
| tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project/build-flydsl | ||
| ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir |
There was a problem hiding this comment.
The cached MLIR tarball is extracted into /workspace/llvm-project/build-flydsl/mlir_install, but the repo's scripts/build_llvm.sh defaults to installing under /workspace/llvm-project/mlir_install (and flir/build.sh expects MLIR_PATH to point at a dir containing lib/cmake/mlir). As written, a cache hit will restore MLIR into a location that subsequent build/install steps won't find.
| rm -rf /workspace/llvm-project/build-flydsl/mlir_install | |
| mkdir -p /workspace/llvm-project/build-flydsl | |
| tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project/build-flydsl | |
| ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir | |
| rm -rf /workspace/llvm-project/mlir_install /workspace/llvm-project/build-flydsl/mlir_install | |
| mkdir -p /workspace/llvm-project /workspace/llvm-project/build-flydsl | |
| tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project | |
| ln -sfn /workspace/llvm-project/mlir_install /workspace/llvm-project/build-flydsl/mlir_install | |
| ls -la /workspace/llvm-project/mlir_install/lib/cmake/mlir |
Switch the FlyDSL ATOM integration workflow to checkout ROCm/ATOM and run the upstream atom_test.sh directly. This removes the local wrapper script while keeping the downstream validation flow aligned with ATOM main.
Remove the apt install step from the FlyDSL ATOM integration workflow because the current atom-dev image already includes the required system tools and has broken ROCm apt dependencies. Keep lightweight tool checks so the workflow still fails early if the image contents drift.
Convert the FlyDSL ATOM workflow into a nightly validation job that resolves the latest promoted FlyDSL wheel from S3 and installs it into atom-dev before running accuracy checks. This removes the in-workflow FlyDSL build path and aligns the test with the published nightly artifact that downstream users consume.
Simplify the nightly ATOM validation workflow to use the FlyDSL version already bundled in the atom-dev image. This removes the S3 wheel download path so the nightly check validates the container image as published.
Replace the inline heredoc in the bundled FlyDSL validation step with direct docker exec python commands. This avoids shell parsing issues while keeping the preinstalled FlyDSL sanity check in the nightly workflow.
Update the ATOM nightly workflow to follow the latest successful FlyDSL CI run on main when selecting the wheel under test. Try the promoted S3 wheel for that run first and fall back to the run artifact when the S3 copy is unavailable.
Summary
rocm/atom-dev:latestaiterfrommainand then install the current FlyDSL commit before running downstream validationDeepSeek-R1-0528andgpt-oss-120b, with logs and accuracy artifacts uploaded for workflow debuggingTest plan
flydsl-atom-integration.yamlwithyaml.safe_loadbash -n scripts/run_atom_downstream.sh