ROCm · aryaman-gupta · Apr 23, 2026 · Apr 10, 2026 · Apr 12, 2026 · Apr 12, 2026
diff --git a/.claude/skills/add-target-atom-op/SKILL.md b/.claude/skills/add-target-atom-op/SKILL.md
diff --git a/.claude/skills/flydsl-kernel-authoring/SKILL.md b/.claude/skills/flydsl-kernel-authoring/SKILL.md
@@ -224,6 +224,39 @@ for i in range(runtime_value):
     ...
 ```
 
+### Frontend Semantic Restrictions
+When writing or reviewing `@flyc.kernel` / `@flyc.jit` code, proactively avoid these patterns because they can conflict with MLIR construction even if they look valid in plain Python.
+
+1. **Do not define values inside `if/else` and use them later outside the branch.** Keep a single explicit definition path.
+   ```python
+   if cond:
+       dst = a
+   else:
+       dst = b
+   use(dst)  # avoid this pattern
+   ```
+
+2. **Do not mutate captured outer variables inside nested helper functions.** Read-only closure capture is acceptable, but writes should go through explicit parameters and return values.
+   ```python
+   def kernel():
+       acc = fx.Float32(0.0)
+
+       def helper(acc):
+           acc = acc + fx.Float32(1.0)
+           return acc
+
+       acc = helper(acc)
+   ```
+
+3. **Avoid early `return`, and do not place `return` / `yield` inside `if/else` branches.** Prefer a single explicit exit so the frontend can determine result types.
+   ```python
+   if cond:
+       out = v0
+   else:
+       out = v1
+   return out
+   ```
+
 ### scf.for with Loop-Carried Values (Software Pipelining)
 
 Use `init=` on `range()` to create an `scf.for` with explicit SSA phi nodes for loop-carried state. This is required for software pipelining (prefetch patterns) where data must flow across iterations.
@@ -300,27 +333,52 @@ result = arith.select(cond, true_val, false_val)
 is_less = arith.cmpf(a, b, predicate="olt")    # ordered less-than
 ```
 
-### Vector Arithmetic (IMPORTANT)
-All arith ops (`addf`, `mulf`, `negf`, `maximumf`, `cmpf`, `select`) work on **both scalars and vectors**.
-To broadcast a scalar to a vector, use `arith.constant_vector`:
+### Internal Types: Vector and Numeric (PREFERRED)
+
+Use FlyDSL's internal typed system instead of raw MLIR ops. The `Vector` class wraps `vector<NxTy>` with operator overloading and type-safe methods.
 
 ```python
-from flydsl._mlir.ir import VectorType
+from flydsl.expr.typing import Vector as Vec, Float32, Float16, BFloat16
+
+# Wrap raw vector values
+acc = Vec(frag_C.load())      # vector<Nxf32> → Vector with * / + operators
+
+# Indexing (replaces vector.extract)
+val = acc[idx]                 # returns Float32 scalar
 
-# Create a splat constant vector (e.g., all 2.0)
-vec_type = VectorType.get([vec_width], fx.T.f32())
-scale_vec = arith.constant_vector(2.0, vec_type)
+# Bitcast (replaces vector.bitcast)
+v_f32 = Vec(raw_vec).bitcast(Float32)  # vector<Nxi32> → vector<Nxf32>
 
-# Now use it with vector ops
-vA = fx.memref_load_vec(rA)        # load vec from register
-vC = arith.mulf(vA, scale_vec)    # element-wise scale
+# Type conversion (replaces arith.trunc_f / arith.ext_f)
+bf16_val = f32_val.to(BFloat16)        # f32 → bf16
+
+# Arithmetic — use Python operators, not arith.mulf/addf
+result = (val * scale_a) * scale_b     # auto-dispatches to mulf
+
+# Splat constant vector
+zeros = Vec.filled(N, 0.0, Float32)
+
+# Index cast — use fx.Int32 instead of arith.index_cast
+idx = fx.Int32(gpu.block_id("x") * tile_m)
 ```
 
+**Prefer internal types over raw ops:**
+| Raw MLIR op | Internal type equivalent |
+|-------------|------------------------|
+| `vector.extract(v, static_position=[i], ...)` | `Vec(v)[i]` |
+| `vector.bitcast(target_ty, v)` | `Vec(v).bitcast(Float32)` |
+| `arith.trunc_f(ty, v)` | `v.to(BFloat16)` |
+| `arith.mulf(a, b)` | `a * b` |
+| `arith.addf(a, b)` | `a + b` |
+| `arith.index_cast(T.i32, v)` | `fx.Int32(v)` |
+
+Still use `arith.constant_vector` for splat and `vector.from_elements` for building vectors from scalars (no Vector equivalent yet).
+
 ### Arith Ops Availability Table
 | Operation | Function | Works on Vectors | Notes |
 |-----------|----------|-----------------|-------|
-| Add | `arith.addf(a, b)` | Yes | |
-| Multiply | `arith.mulf(a, b)` | Yes | |
+| Add | `a + b` or `arith.addf(a, b)` | Yes | |
+| Multiply | `a * b` or `arith.mulf(a, b)` | Yes | |
 | Negate | `arith.negf(a)` | Yes | |
 | Max | `arith.maximumf(a, b)` | Yes | Good for ReLU |
 | Compare | `arith.cmpf(a, b, pred)` | Yes | Returns i1/vec<i1> |

diff --git a/.claude/skills/port-to-layout-api/SKILL.md b/.claude/skills/port-to-layout-api/SKILL.md
@@ -31,7 +31,7 @@ Read the kernel and classify each buffer_load/buffer_store:
 | Pattern | Layout API Port | Example |
 |---------|----------------|---------|
 | Contiguous vec load along innermost dim | `make_buffer_tensor` + `BufferCopy128b` | Load 8xf16 from row |
-| Scalar load (vec_width=1) of metadata | Keep as `buffer_ops.buffer_load` | Position/slot/mask loads |
+| Scalar load (vec_width=1) | `make_buffer_tensor` + `BufferCopy32b`/`BufferCopy16b` | Scale/metadata loads |
 | Scattered store (non-contiguous layout) | Keep as `buffer_ops.buffer_store` | Non-flash value_cache |
 | Contiguous vec store along innermost dim | `make_buffer_tensor` + `BufferCopy` | Store 8xf16 to output |
 
@@ -138,12 +138,33 @@ if is_valid:
     _store_vec(val, out_div, idx)
 ```
 
-### Step 6: Keep Scalar Accesses as buffer_ops
+### Step 6: Scalar Loads via Layout API
 
-Not everything should use the layout API. Keep `buffer_ops` for:
-- Scalar metadata loads: `buffer_ops.buffer_load(rsrc, idx, vec_width=1, dtype=T.i32)`
-- Scattered stores where elements are non-contiguous in memory
-- Single-element stores (e.g., writing one scale value per block)
+Scalar loads (vec_width=1) also work through the layout API:
+
+```python
+buf = fx.rocdl.make_buffer_tensor(tensor, max_size=True)
+copy_atom_s = fx.make_copy_atom(fx.rocdl.BufferCopy32b(), 32)  # f32 scalar
+scalar_reg_ty = fx.MemRefType.get(T.f32, fx.LayoutType.get(1, 1), fx.AddressSpace.Register)
+scalar_reg_lay = fx.make_layout(1, 1)
+div = fx.logical_divide(buf, fx.make_layout(1, 1))
+
+def load_scalar(index):
+    r = fx.memref_alloca(scalar_reg_ty, scalar_reg_lay)
+    fx.copy_atom_call(copy_atom_s, fx.slice(div, (None, fx.Int32(index))), r)
+    return Vec(fx.memref_load_vec(r))[0]  # extract scalar from vector<1xf32>
+```
+
+Scalar stores work the same way (reverse src/dst):
+```python
+def store_scalar(index, val):
+    r = fx.memref_alloca(scalar_reg_ty, scalar_reg_lay)
+    fx.memref_store_vec(Vec.filled(1, val, Float32), r)
+    fx.copy_atom_call(copy_atom_s, r, fx.slice(div, (None, fx.Int32(index))))
+```
+
+Keep `buffer_ops` only for:
+- Scattered stores where elements are truly non-contiguous in memory
 
 ### Step 7: Remove Dead Code
 

diff --git a/.github/workflows/flydsl.yaml b/.github/workflows/flydsl.yaml
@@ -9,6 +9,11 @@ on:
       - main
   workflow_dispatch:
 
+permissions:
+  contents: read
+  actions: read
+  pull-requests: read
+
 concurrency:
   group: ${{ github.workflow }}-${{ github.ref }}
   cancel-in-progress: true
@@ -19,10 +24,18 @@ env:
   GITHUB_COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.event.head_commit.id }}
 
 jobs:
+  # ---------------------------------------------------------------------------
+  # Single-GPU tests: kernels, unit, examples, MLIR FileCheck, benchmarks.
+  # Runs on 1-GPU and Navi runners only.
+  # ---------------------------------------------------------------------------
   test:
     strategy:
       matrix:
-        runners: [ 'linux-flydsl-mi325-1', 'linux-flydsl-mi355-1', 'linux-flydsl-navi-2' ]
+        runners: [
+          'linux-flydsl-mi325-1',
+          'linux-flydsl-mi355-1',
+          'linux-flydsl-navi-2',
+        ]
       fail-fast: false
     runs-on: ${{ matrix.runners }}
     steps:
@@ -169,3 +182,167 @@ jobs:
         run: |
           docker stop flydsl_test
           docker rm flydsl_test
+
+  # ---------------------------------------------------------------------------
+  # Multi-GPU allreduce tests: ONLY for 8-GPU runners.
+  # Runs on BOTH linux-flydsl-mi325-8 AND linux-flydsl-mi355-8 independently.
+  # fail-fast: false ensures both runners always complete even if one fails.
+  # ---------------------------------------------------------------------------
+  multi-gpu:
+    needs: test
+    name: Multi-GPU AllReduce Tests (${{ matrix.runners }})
+    timeout-minutes: 120
+    strategy:
+      matrix:
+        runners: [
+          'linux-flydsl-mi325-8',
+          'linux-flydsl-mi355-8',
+        ]
+      fail-fast: false
+    runs-on: ${{ matrix.runners }}
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          repository: ${{ env.GITHUB_REPO_NAME }}
+          ref: ${{ env.GITHUB_COMMIT_SHA }}
+          path: flydsl-test
+
+      - name: Start CI container
+        run: |
+          echo "Clean up containers..."
+          docker ps -aq -f name=flydsl_test | xargs -r docker stop | xargs -r docker rm || true
+
+          echo "Start CI container..."
+          if [ -f "/etc/podinfo/gha-render-devices" ]; then
+            DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
+          else
+            DEVICE_FLAG="--device /dev/dri"
+          fi
+
+          docker run -dt --network=host --user root --device=/dev/kfd $DEVICE_FLAG \
+          -v "${GITHUB_WORKSPACE:-$PWD}/flydsl-test:/flydsl-test" \
+          --ipc=host --group-add video \
+          --shm-size 16g \
+          --cap-add=SYS_PTRACE \
+          --security-opt seccomp=unconfined \
+          -w /flydsl-test \
+          --name flydsl_test \
+          ${{ env.DOCKER_IMAGE }}
+        env:
+          GITHUB_WORKSPACE: ${{ github.workspace }}
+
+      - name: Install dependencies
+        run: |
+          docker exec flydsl_test bash -c "apt-get update && apt-get install -y cmake build-essential patchelf"
+          docker exec flydsl_test bash -c "python3 -m pip install -U pip setuptools wheel"
+          docker exec flydsl_test bash -c "python3 -m pip install ninja>=1.11.1"
+          docker exec flydsl_test bash -c "python3 -m pip install -U 'hypothesis>=6.82.0'"
+          docker exec flydsl_test bash -c "git config --global --add safe.directory /flydsl-test && cd /flydsl-test && git log"
+
+      - name: Restore cached MLIR install tarball (if available)
+        id: mlir-cache
+        uses: actions/cache@v4
+        with:
+          path: mlir_install.tgz
+          key: mlir-install-${{ matrix.runners }}-${{ hashFiles('flydsl-test/thirdparty/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/CMakeLists.txt', 'flydsl-test/.github/workflows/flydsl.yaml') }}
+
+      - name: Use cached MLIR install tarball (skip LLVM build)
+        if: steps.mlir-cache.outputs.cache-hit == 'true'
+        run: |
+          ls -lh mlir_install.tgz
+          docker cp mlir_install.tgz flydsl_test:/tmp/mlir_install.tgz
+          docker exec flydsl_test bash -c "rm -rf /llvm-project/mlir_install && mkdir -p /llvm-project && tar -xzf /tmp/mlir_install.tgz -C /llvm-project"
+          docker exec flydsl_test bash -c "ls -la /llvm-project/mlir_install/lib/cmake/mlir"
+
+      - name: Build LLVM
+        if: steps.mlir-cache.outputs.cache-hit != 'true'
+        run: |
+          set -ex
+          docker exec flydsl_test bash -c "cd /flydsl-test && bash scripts/build_llvm.sh"
+          docker exec flydsl_test bash -c "ls -la /llvm-project/mlir_install/lib/cmake/mlir"
+          docker cp flydsl_test:/llvm-project/mlir_install.tgz ./mlir_install.tgz || true
+
+      - name: Build FlyDSL (uses MLIR install prefix)
+        run: |
+          docker exec flydsl_test bash -c "export MLIR_PATH=/llvm-project/mlir_install && cd /flydsl-test && python3 -m pip install -e . --use-pep517"
+
+      - name: Run multi-GPU allreduce tests
+        timeout-minutes: 30
+        run: |
+          docker exec flydsl_test bash -c "
+            cd /flydsl-test
+            python3 -m pytest tests/kernels/test_allreduce.py \
+              -m multi_gpu -v --no-header --tb=short
+          "
+
+      - name: Run allreduce benchmark (PR)
+        timeout-minutes: 30
+        run: |
+          docker exec flydsl_test bash -c "
+            cd /flydsl-test
+            python3 tests/kernels/test_allreduce.py \
+              --world_size 8 --iters 51 --warmup 5 \
+              --allreduce_impl flydsl --mode cudagraph \
+              --shapes '2,7168,fp16;32,8192,fp32;128,8192,fp16;1024,7168,bf16;4096,8192,bf16' \
+              --output_csv /tmp/bench_pr.csv
+          "
+
+      - name: Build main branch baseline
+        id: build-main
+        timeout-minutes: 20
+        continue-on-error: true
+        run: |
+          docker exec flydsl_test bash -c "
+            cd /flydsl-test
+            git fetch origin main --depth=1
+            git worktree add /tmp/flydsl-main origin/main
+            cd /tmp/flydsl-main
+            export MLIR_PATH=/llvm-project/mlir_install
+            python3 -m pip install -e . --use-pep517 2>&1 | tail -5
+          "
+
+      - name: Run allreduce benchmark (main)
+        id: bench-main
+        if: steps.build-main.outcome == 'success'
+        timeout-minutes: 30
+        continue-on-error: true
+        run: |
+          docker exec flydsl_test bash -c "
+            cp /flydsl-test/tests/kernels/test_allreduce.py \
+               /tmp/flydsl-main/tests/kernels/test_allreduce.py
+            cd /tmp/flydsl-main
+            python3 tests/kernels/test_allreduce.py \
+              --world_size 8 --iters 51 --warmup 5 \
+              --allreduce_impl flydsl --mode cudagraph \
+              --shapes '2,7168,fp16;32,8192,fp32;128,8192,fp16;1024,7168,bf16;4096,8192,bf16' \
+              --output_csv /tmp/bench_main.csv
+          "
+
+      - name: Check performance regression (PR vs main)
+        if: steps.bench-main.outcome != 'skipped'
+        timeout-minutes: 5
+        run: |
+          docker exec flydsl_test bash -c "
+            cd /flydsl-test
+            python3 tests/kernels/compare_allreduce_benchmark.py \
+              /tmp/bench_main.csv /tmp/bench_pr.csv
+          "
+
+      - name: Show test logs
+        if: failure()
+        run: |
+          docker exec flydsl_test bash -c 'cd /tmp && tar czf /tmp/logs.tgz *.log 2>/dev/null || echo "no logs"'
+          docker cp flydsl_test:/tmp/logs.tgz . || true
+          if [ -f logs.tgz ]; then
+            tar -xzf logs.tgz || true
+            cat *.log || true
+          else
+            echo "logs.tgz not found; skipping log extraction"
+          fi
+
+      - name: Clean up
+        if: always()
+        run: |
+          docker stop flydsl_test
+          docker rm flydsl_test
diff --git a/.github/workflows/publish-pypi.yaml b/.github/workflows/publish-pypi.yaml
@@ -25,10 +25,10 @@ jobs:
         id: version
         run: |
           TAG_VERSION="${GITHUB_REF_NAME#v}"
-          PACKAGE_VERSION="$(awk -F'"' '/^_BASE_VERSION = / {print $2; exit}' python/flydsl/__init__.py)"
+          PACKAGE_VERSION="$(awk -F'"' '/^__version__ = / {print $2; exit}' python/flydsl/__init__.py)"
 
           if [ -z "${PACKAGE_VERSION}" ]; then
-            echo "Failed to find _BASE_VERSION in python/flydsl/__init__.py" >&2
+            echo "Failed to find __version__ in python/flydsl/__init__.py" >&2
             exit 1
           fi
 

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,6 +1,6 @@
 # FlyDSL Project Guide
 
-FlyDSL (Flexible Layout Python DSL) — a Python DSL and MLIR-based compiler stack for authoring high-performance GPU kernels with explicit layouts and tiling on AMD GPUs (MI300X/MI350/MI450).
+FlyDSL (Flexible Layout Python DSL) — a Python DSL and MLIR-based compiler stack for authoring high-performance GPU kernels with explicit layouts and tiling on AMD GPUs (MI300X/MI350/MI355X/MI450).
 
 ## Repository Layout
 
@@ -59,9 +59,10 @@ FLYDSL_DUMP_IR=1 PYTHONPATH=./ python tests/kernels/test_pa.py # Dump MLIR IR at
 
 | Arch | Chips | Wave size | MMA | Key features |
 |---|---|---|---|---|
-| **CDNA3** | gfx942/gfx950 (MI300X) | 64 | MFMA | BufferCopy, preshuffle GEMM |
-| **RDNA** | gfx10xx/gfx11xx/gfx12xx | 32 | WMMA | RDNA-specific GEMM |
-| **gfx1250** | MI400 | 32 | WMMA | TDM ops, FP8/FP4 GEMM, multi-stage pipeline |
+| **CDNA3** | gfx942 (MI300X) | 64 | MFMA | BufferCopy, preshuffle GEMM |
+| **CDNA4** | gfx950 (MI350/MI355X) | 64 | MFMA | MFMA_SCALE, FP4, 160KB LDS |
+| **RDNA4** | gfx1201 (Radeon AI PRO R9700) | 32 | WMMA | RDNA-specific GEMM |
+| **gfx1250** | MI450 | 32 | WMMA | TDM ops, FP8/FP4 GEMM, multi-stage pipeline |
 
 ## Key Conventions & Pitfalls
 
@@ -73,3 +74,6 @@ FLYDSL_DUMP_IR=1 PYTHONPATH=./ python tests/kernels/test_pa.py # Dump MLIR IR at
 - **Layout API vs buffer_ops**: New kernels should use `fx.rocdl.make_buffer_tensor()` + `copy_atom_call` (layout API). Raw `buffer_ops.create_buffer_resource()` is legacy
 - **Arch detection**: Use `from flydsl.runtime.device import get_rocm_arch`
 - **`range` vs `range_constexpr`**: Use `range_constexpr` for compile-time unrolled loops; `range(start, stop, step, init=[...])` for `scf.for` with loop-carried values
+- **Branch-local defs**: Do not define a value inside `if/else` and then use it after the branch. Hoist the variable or rewrite the logic so later uses see a single explicit definition path.
+- **Nested helper captures**: Inside `@flyc.kernel` / `@flyc.jit`, nested helper functions must not mutate captured outer variables. Read-only capture is acceptable, but writes should go through explicit parameters / returns.
+- **Single-exit control flow**: Avoid early `return`. Do not place `return` or `yield` inside `if/else` branches; keep a single explicit exit path so MLIR result types stay well-defined.