CI: add standalone ATOM integration workflow by gyohuangxin · Pull Request #428 · ROCm/FlyDSL

gyohuangxin · 2026-04-22T09:35:36Z

Summary

add a standalone FlyDSL workflow that runs ATOM integration coverage in rocm/atom-dev:latest
reinstall aiter from main and then install the current FlyDSL commit before running downstream validation
run ATOM accuracy coverage for DeepSeek-R1-0528 and gpt-oss-120b, with logs and accuracy artifacts uploaded for workflow debugging

Test plan

Parse flydsl-atom-integration.yaml with yaml.safe_load
Run bash -n scripts/run_atom_downstream.sh
Check IDE lints for the new workflow and script
Run the new GitHub Actions workflow on the PR

Add a dedicated FlyDSL workflow to validate ATOM against the latest atom-dev image with aiter reinstalled from main and the current FlyDSL commit. This keeps downstream accuracy coverage separate from the existing FlyDSL CI while exercising the same DeepSeek-R1 and GPT-OSS models used by aiter CI.

Copilot

Pull request overview

This PR adds a dedicated GitHub Actions workflow to run ATOM downstream accuracy coverage inside rocm/atom-dev:latest, plus a helper script to launch/stop the ATOM OpenAI-compatible server and run lm_eval against it.

Changes:

Add scripts/run_atom_downstream.sh to launch/warm up/stop the ATOM server and run a GSM8K accuracy check via lm_eval.
Add .github/workflows/flydsl-atom-integration.yaml to run ATOM accuracy on two models, cache an MLIR install tarball, and upload logs/results as artifacts.
Reinstall aiter from main and install FlyDSL from the checked-out PR commit inside the container before running validation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
`scripts/run_atom_downstream.sh`	Adds server lifecycle + accuracy evaluation script used by the workflow.
`.github/workflows/flydsl-atom-integration.yaml`	New standalone CI workflow to run ATOM accuracy coverage in a ROCm container, with caching + artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T09:43:29Z

+            bash scripts/build_llvm.sh
+            ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir
+          "
+          docker cp "${CONTAINER_NAME}:/workspace/llvm-project/build-flydsl/mlir_install.tgz" ./mlir_install.tgz
+


This docker cp path assumes scripts/build_llvm.sh produced /workspace/llvm-project/build-flydsl/mlir_install.tgz, but the script's default output is /workspace/llvm-project/mlir_install.tgz (and the build dir is buildmlir). With the current paths, the tarball copy (and therefore caching) will fail.

Copilot · 2026-04-22T09:43:29Z

+
+            echo '=== Installing FlyDSL from the checked out PR commit ==='
+            pip uninstall -y flydsl || true
+            cd /workspace/flydsl-test


pip install -e . will invoke flir/build.sh (via setup.py) if the embedded MLIR runtime is missing, and flir/build.sh fails unless MLIR_PATH is set or ${REPO_ROOT}/llvm-project/mlir_install exists. In this workflow, LLVM/MLIR is built/restored under /workspace/llvm-project/..., so you likely need to export MLIR_PATH (or build into the repo-local flydsl-test/llvm-project/) before installing FlyDSL.

Suggested change

cd /workspace/flydsl-test

cd /workspace/flydsl-test

if [ -d /workspace/llvm-project/mlir_install ]; then

export MLIR_PATH=/workspace/llvm-project/mlir_install

elif [ -d /workspace/flydsl-test/llvm-project/mlir_install ]; then

export MLIR_PATH=/workspace/flydsl-test/llvm-project/mlir_install

else

echo 'ERROR: MLIR install not found. Expected /workspace/llvm-project/mlir_install or /workspace/flydsl-test/llvm-project/mlir_install.' >&2

exit 1

fi

echo "Using MLIR_PATH=${MLIR_PATH}"

Copilot · 2026-04-22T09:43:30Z

+    if curl -sf "${ATOM_BASE_URL}/health" -o /dev/null; then
+      echo "ATOM server HTTP endpoint is ready."
+      server_up=true
+      break
+    fi
+
+    echo "Waiting for ATOM server... (${i}/${max_retries})"
+    sleep "${retry_interval}"


curl in the readiness loop has no connect/overall timeout. If the TCP connect hangs (or localhost DNS/stack is wedged), the script can block indefinitely inside wait_for_server rather than retrying. Add --connect-timeout/--max-time so each attempt fails fast and the retry logic works as intended.

Copilot · 2026-04-22T09:43:30Z

+
+    lm_eval \
+      --model local-completions \
+      --model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \


trust_remote_code=True allows executing arbitrary code from the model repository during evaluation, which is a supply-chain risk in CI. If this is required for these models, consider pinning model revisions (or a vetted local snapshot) and limiting trust_remote_code to only the cases that need it.

Suggested change

--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \

--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False \

Copilot · 2026-04-22T09:43:30Z

+        uses: actions/cache@v4
+        with:
+          path: mlir_install.tgz
+          key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}


The cache key hashes flydsl-test/cmake/llvm-hash.txt, but this repo doesn't have a cmake/ directory (so hashFiles will treat it as missing). This makes the key misleading and may reduce cache invalidation when LLVM-related inputs change; consider removing it or hashing an existing file that captures the intended LLVM/MLIR version input.

Suggested change

key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}

key: flydsl-mlir-install-${{ hashFiles('flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}

Copilot · 2026-04-22T09:43:30Z

+            -e HF_TOKEN="${secrets.HF_TOKEN:-}" \
+            --name "${CONTAINER_NAME}" \
+            "${{ env.ATOM_BASE_IMAGE }}"
+        env:
+          GITHUB_WORKSPACE: ${{ github.workspace }}
+          MODEL_ENV_VARS: ${{ matrix.env_vars }}


The HF_TOKEN value here uses bash-style parameter expansion with a dotted variable name (secrets.HF_TOKEN), which will cause a bad substitution error under set -euo pipefail and prevent the container from starting. Pass the secret using GitHub Actions expression syntax (e.g. via env: + -e HF_TOKEN=...) instead of ${secrets.HF_TOKEN:-}.

Suggested change

-e HF_TOKEN="${secrets.HF_TOKEN:-}" \

--name "${CONTAINER_NAME}" \

"${{ env.ATOM_BASE_IMAGE }}"

env:

GITHUB_WORKSPACE: ${{ github.workspace }}

MODEL_ENV_VARS: ${{ matrix.env_vars }}

-e HF_TOKEN="${HF_TOKEN:-}" \

--name "${CONTAINER_NAME}" \

"${{ env.ATOM_BASE_IMAGE }}"

env:

GITHUB_WORKSPACE: ${{ github.workspace }}

MODEL_ENV_VARS: ${{ matrix.env_vars }}

HF_TOKEN: ${{ secrets.HF_TOKEN }}

Copilot · 2026-04-22T09:43:31Z

+            rm -rf /workspace/llvm-project/build-flydsl/mlir_install
+            mkdir -p /workspace/llvm-project/build-flydsl
+            tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project/build-flydsl
+            ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir


The cached MLIR tarball is extracted into /workspace/llvm-project/build-flydsl/mlir_install, but the repo's scripts/build_llvm.sh defaults to installing under /workspace/llvm-project/mlir_install (and flir/build.sh expects MLIR_PATH to point at a dir containing lib/cmake/mlir). As written, a cache hit will restore MLIR into a location that subsequent build/install steps won't find.

Suggested change

rm -rf /workspace/llvm-project/build-flydsl/mlir_install

mkdir -p /workspace/llvm-project/build-flydsl

tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project/build-flydsl

ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir

rm -rf /workspace/llvm-project/mlir_install /workspace/llvm-project/build-flydsl/mlir_install

mkdir -p /workspace/llvm-project /workspace/llvm-project/build-flydsl

tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project

ln -sfn /workspace/llvm-project/mlir_install /workspace/llvm-project/build-flydsl/mlir_install

ls -la /workspace/llvm-project/mlir_install/lib/cmake/mlir

Switch the FlyDSL ATOM integration workflow to checkout ROCm/ATOM and run the upstream atom_test.sh directly. This removes the local wrapper script while keeping the downstream validation flow aligned with ATOM main.

Remove the apt install step from the FlyDSL ATOM integration workflow because the current atom-dev image already includes the required system tools and has broken ROCm apt dependencies. Keep lightweight tool checks so the workflow still fails early if the image contents drift.

Convert the FlyDSL ATOM workflow into a nightly validation job that resolves the latest promoted FlyDSL wheel from S3 and installs it into atom-dev before running accuracy checks. This removes the in-workflow FlyDSL build path and aligns the test with the published nightly artifact that downstream users consume.

Simplify the nightly ATOM validation workflow to use the FlyDSL version already bundled in the atom-dev image. This removes the S3 wheel download path so the nightly check validates the container image as published.

Replace the inline heredoc in the bundled FlyDSL validation step with direct docker exec python commands. This avoids shell parsing issues while keeping the preinstalled FlyDSL sanity check in the nightly workflow.

Update the ATOM nightly workflow to follow the latest successful FlyDSL CI run on main when selecting the wheel under test. Try the promoted S3 wheel for that run first and fall back to the run artifact when the S3 copy is unavailable.

Copilot AI review requested due to automatic review settings April 22, 2026 09:35

Merge branch 'main' into ci/flydsl-atom-integration-main

f6d9235

github-advanced-security AI found potential problems Apr 22, 2026

View reviewed changes

Comment thread .github/workflows/flydsl-atom-integration.yaml Fixed

Copilot started reviewing on behalf of gyohuangxin April 22, 2026 09:37 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

gyohuangxin added 6 commits April 22, 2026 17:43

CI: use upstream ATOM test script

be8d6e9

Switch the FlyDSL ATOM integration workflow to checkout ROCm/ATOM and run the upstream atom_test.sh directly. This removes the local wrapper script while keeping the downstream validation flow aligned with ATOM main.

CI: use bundled FlyDSL in atom nightly

3c51037

Simplify the nightly ATOM validation workflow to use the FlyDSL version already bundled in the atom-dev image. This removes the S3 wheel download path so the nightly check validates the container image as published.

CI: fix bundled FlyDSL sanity check

cde118b

Replace the inline heredoc in the bundled FlyDSL validation step with direct docker exec python commands. This avoids shell parsing issues while keeping the preinstalled FlyDSL sanity check in the nightly workflow.

CI: resolve FlyDSL wheel from latest CI run

6be3d72

Update the ATOM nightly workflow to follow the latest successful FlyDSL CI run on main when selecting the wheel under test. Try the promoted S3 wheel for that run first and fall back to the run artifact when the S3 copy is unavailable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: add standalone ATOM integration workflow#428

CI: add standalone ATOM integration workflow#428
gyohuangxin wants to merge 8 commits intomainfrom
ci/flydsl-atom-integration-main

gyohuangxin commented Apr 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            cd /workspace/flydsl-test
+            cd /workspace/flydsl-test
+            if [ -d /workspace/llvm-project/mlir_install ]; then
+              export MLIR_PATH=/workspace/llvm-project/mlir_install
+            elif [ -d /workspace/flydsl-test/llvm-project/mlir_install ]; then
+              export MLIR_PATH=/workspace/flydsl-test/llvm-project/mlir_install
+            else
+              echo 'ERROR: MLIR install not found. Expected /workspace/llvm-project/mlir_install or /workspace/flydsl-test/llvm-project/mlir_install.' >&2
+              exit 1
+            fi
+            echo "Using MLIR_PATH=${MLIR_PATH}"

	--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \
	--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False \

	key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}
	key: flydsl-mlir-install-${{ hashFiles('flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}

Conversation

gyohuangxin commented Apr 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants