Skip to content

CI: add standalone ATOM integration workflow#428

Open
gyohuangxin wants to merge 8 commits intomainfrom
ci/flydsl-atom-integration-main
Open

CI: add standalone ATOM integration workflow#428
gyohuangxin wants to merge 8 commits intomainfrom
ci/flydsl-atom-integration-main

Conversation

@gyohuangxin
Copy link
Copy Markdown
Member

Summary

  • add a standalone FlyDSL workflow that runs ATOM integration coverage in rocm/atom-dev:latest
  • reinstall aiter from main and then install the current FlyDSL commit before running downstream validation
  • run ATOM accuracy coverage for DeepSeek-R1-0528 and gpt-oss-120b, with logs and accuracy artifacts uploaded for workflow debugging

Test plan

  • Parse flydsl-atom-integration.yaml with yaml.safe_load
  • Run bash -n scripts/run_atom_downstream.sh
  • Check IDE lints for the new workflow and script
  • Run the new GitHub Actions workflow on the PR

Add a dedicated FlyDSL workflow to validate ATOM against the latest atom-dev image with aiter reinstalled from main and the current FlyDSL commit. This keeps downstream accuracy coverage separate from the existing FlyDSL CI while exercising the same DeepSeek-R1 and GPT-OSS models used by aiter CI.
Copilot AI review requested due to automatic review settings April 22, 2026 09:35
Comment thread .github/workflows/flydsl-atom-integration.yaml Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a dedicated GitHub Actions workflow to run ATOM downstream accuracy coverage inside rocm/atom-dev:latest, plus a helper script to launch/stop the ATOM OpenAI-compatible server and run lm_eval against it.

Changes:

  • Add scripts/run_atom_downstream.sh to launch/warm up/stop the ATOM server and run a GSM8K accuracy check via lm_eval.
  • Add .github/workflows/flydsl-atom-integration.yaml to run ATOM accuracy on two models, cache an MLIR install tarball, and upload logs/results as artifacts.
  • Reinstall aiter from main and install FlyDSL from the checked-out PR commit inside the container before running validation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
scripts/run_atom_downstream.sh Adds server lifecycle + accuracy evaluation script used by the workflow.
.github/workflows/flydsl-atom-integration.yaml New standalone CI workflow to run ATOM accuracy coverage in a ROCm container, with caching + artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +139 to +143
bash scripts/build_llvm.sh
ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir
"
docker cp "${CONTAINER_NAME}:/workspace/llvm-project/build-flydsl/mlir_install.tgz" ./mlir_install.tgz

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docker cp path assumes scripts/build_llvm.sh produced /workspace/llvm-project/build-flydsl/mlir_install.tgz, but the script's default output is /workspace/llvm-project/mlir_install.tgz (and the build dir is buildmlir). With the current paths, the tarball copy (and therefore caching) will fail.

Copilot uses AI. Check for mistakes.

echo '=== Installing FlyDSL from the checked out PR commit ==='
pip uninstall -y flydsl || true
cd /workspace/flydsl-test
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install -e . will invoke flir/build.sh (via setup.py) if the embedded MLIR runtime is missing, and flir/build.sh fails unless MLIR_PATH is set or ${REPO_ROOT}/llvm-project/mlir_install exists. In this workflow, LLVM/MLIR is built/restored under /workspace/llvm-project/..., so you likely need to export MLIR_PATH (or build into the repo-local flydsl-test/llvm-project/) before installing FlyDSL.

Suggested change
cd /workspace/flydsl-test
cd /workspace/flydsl-test
if [ -d /workspace/llvm-project/mlir_install ]; then
export MLIR_PATH=/workspace/llvm-project/mlir_install
elif [ -d /workspace/flydsl-test/llvm-project/mlir_install ]; then
export MLIR_PATH=/workspace/flydsl-test/llvm-project/mlir_install
else
echo 'ERROR: MLIR install not found. Expected /workspace/llvm-project/mlir_install or /workspace/flydsl-test/llvm-project/mlir_install.' >&2
exit 1
fi
echo "Using MLIR_PATH=${MLIR_PATH}"

Copilot uses AI. Check for mistakes.
Comment thread scripts/run_atom_downstream.sh Outdated
Comment on lines +33 to +40
if curl -sf "${ATOM_BASE_URL}/health" -o /dev/null; then
echo "ATOM server HTTP endpoint is ready."
server_up=true
break
fi

echo "Waiting for ATOM server... (${i}/${max_retries})"
sleep "${retry_interval}"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl in the readiness loop has no connect/overall timeout. If the TCP connect hangs (or localhost DNS/stack is wedged), the script can block indefinitely inside wait_for_server rather than retrying. Add --connect-timeout/--max-time so each attempt fails fast and the retry logic works as intended.

Copilot uses AI. Check for mistakes.
Comment thread scripts/run_atom_downstream.sh Outdated

lm_eval \
--model local-completions \
--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trust_remote_code=True allows executing arbitrary code from the model repository during evaluation, which is a supply-chain risk in CI. If this is required for these models, consider pinning model revisions (or a vetted local snapshot) and limiting trust_remote_code to only the cases that need it.

Suggested change
--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False,trust_remote_code=True \
--model_args model="${MODEL_PATH}",base_url="${ATOM_BASE_URL}/v1/completions",num_concurrent=65,max_retries=3,tokenized_requests=False \

Copilot uses AI. Check for mistakes.
uses: actions/cache@v4
with:
path: mlir_install.tgz
key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key hashes flydsl-test/cmake/llvm-hash.txt, but this repo doesn't have a cmake/ directory (so hashFiles will treat it as missing). This makes the key misleading and may reduce cache invalidation when LLVM-related inputs change; consider removing it or hashing an existing file that captures the intended LLVM/MLIR version input.

Suggested change
key: flydsl-mlir-install-${{ hashFiles('flydsl-test/cmake/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}
key: flydsl-mlir-install-${{ hashFiles('flydsl-test/scripts/build_llvm.sh', 'flydsl-test/.github/workflows/flydsl-atom-integration.yaml') }}

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +104
-e HF_TOKEN="${secrets.HF_TOKEN:-}" \
--name "${CONTAINER_NAME}" \
"${{ env.ATOM_BASE_IMAGE }}"
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
MODEL_ENV_VARS: ${{ matrix.env_vars }}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HF_TOKEN value here uses bash-style parameter expansion with a dotted variable name (secrets.HF_TOKEN), which will cause a bad substitution error under set -euo pipefail and prevent the container from starting. Pass the secret using GitHub Actions expression syntax (e.g. via env: + -e HF_TOKEN=...) instead of ${secrets.HF_TOKEN:-}.

Suggested change
-e HF_TOKEN="${secrets.HF_TOKEN:-}" \
--name "${CONTAINER_NAME}" \
"${{ env.ATOM_BASE_IMAGE }}"
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
MODEL_ENV_VARS: ${{ matrix.env_vars }}
-e HF_TOKEN="${HF_TOKEN:-}" \
--name "${CONTAINER_NAME}" \
"${{ env.ATOM_BASE_IMAGE }}"
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
MODEL_ENV_VARS: ${{ matrix.env_vars }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +129
rm -rf /workspace/llvm-project/build-flydsl/mlir_install
mkdir -p /workspace/llvm-project/build-flydsl
tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project/build-flydsl
ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cached MLIR tarball is extracted into /workspace/llvm-project/build-flydsl/mlir_install, but the repo's scripts/build_llvm.sh defaults to installing under /workspace/llvm-project/mlir_install (and flir/build.sh expects MLIR_PATH to point at a dir containing lib/cmake/mlir). As written, a cache hit will restore MLIR into a location that subsequent build/install steps won't find.

Suggested change
rm -rf /workspace/llvm-project/build-flydsl/mlir_install
mkdir -p /workspace/llvm-project/build-flydsl
tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project/build-flydsl
ls -la /workspace/llvm-project/build-flydsl/mlir_install/lib/cmake/mlir
rm -rf /workspace/llvm-project/mlir_install /workspace/llvm-project/build-flydsl/mlir_install
mkdir -p /workspace/llvm-project /workspace/llvm-project/build-flydsl
tar -xzf /tmp/mlir_install.tgz -C /workspace/llvm-project
ln -sfn /workspace/llvm-project/mlir_install /workspace/llvm-project/build-flydsl/mlir_install
ls -la /workspace/llvm-project/mlir_install/lib/cmake/mlir

Copilot uses AI. Check for mistakes.
Switch the FlyDSL ATOM integration workflow to checkout ROCm/ATOM and run the upstream atom_test.sh directly. This removes the local wrapper script while keeping the downstream validation flow aligned with ATOM main.
Remove the apt install step from the FlyDSL ATOM integration workflow because the current atom-dev image already includes the required system tools and has broken ROCm apt dependencies. Keep lightweight tool checks so the workflow still fails early if the image contents drift.
Convert the FlyDSL ATOM workflow into a nightly validation job that resolves the latest promoted FlyDSL wheel from S3 and installs it into atom-dev before running accuracy checks. This removes the in-workflow FlyDSL build path and aligns the test with the published nightly artifact that downstream users consume.
Simplify the nightly ATOM validation workflow to use the FlyDSL version already bundled in the atom-dev image. This removes the S3 wheel download path so the nightly check validates the container image as published.
Replace the inline heredoc in the bundled FlyDSL validation step with direct docker exec python commands. This avoids shell parsing issues while keeping the preinstalled FlyDSL sanity check in the nightly workflow.
Update the ATOM nightly workflow to follow the latest successful FlyDSL CI run on main when selecting the wheel under test. Try the promoted S3 wheel for that run first and fall back to the run artifact when the S3 copy is unavailable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants