Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
3346bc9
implement Vector type, remove vector dialect in kernels and use soffs…
coderfeli Apr 10, 2026
9a84cc9
[FIX] Support AOT cross-compilation with COMPILE_ONLY cache save (#383)
coderfeli Apr 12, 2026
ab2c1cb
[Docs] Add MI355X/gfx1201 and MI450/gfx1250 to platform docs (#384)
coderfeli Apr 12, 2026
b1688aa
update to v0.1.3 (#385)
coderfeli Apr 12, 2026
bf6a8d0
Pr/a16wi4 group (#370)
yadaish Apr 13, 2026
c5d83df
[FEAT] Add get_leaves op and support convert dyn_tuple to py_tuple (#…
sjfeng1999 Apr 13, 2026
c5c4f6f
remove ci massive llvm log (#394)
coderfeli Apr 13, 2026
e4ab0a9
Change result type of elem_less/equal to i1 (#392)
sjfeng1999 Apr 13, 2026
c8dbdaf
[MLIR] enhance get_scalar op, only requires dyn_leaf_cnt = 1 (#398)
sjfeng1999 Apr 14, 2026
74449c4
support gptoss gemm shape by padding (#397)
aoli26 Apr 14, 2026
6cb26c4
Add MakeFragmentLayoutLikeOp, improve robustness of prim funcs (#399)
sjfeng1999 Apr 14, 2026
fd9ca4f
support split-k algo for moe_gemm_2stage (#390)
yadaish Apr 14, 2026
f65e930
Gfx1250 moe (#402)
XingerZhu Apr 16, 2026
6e635c6
Add CI testcases and benchmark for allreduce (#387)
yanboshao Apr 16, 2026
a28e150
[OPT] Add pass: convert-atom-call-to-ssa-form (#407)
sjfeng1999 Apr 16, 2026
2e71f65
[docs] add frontend semantic restrictions for MLIR kernel authoring (…
zhiding512 Apr 16, 2026
f9120a8
Detach requires_grad tensors before FlyDSL DLPack export (#409)
zhiding512 Apr 16, 2026
68f5725
[OPT] Add pass promote-regmem-to-vectorssa (#410)
sjfeng1999 Apr 17, 2026
8d3456c
fix version err after uninstall (#413)
coderfeli Apr 19, 2026
23f59ab
Add fused epilogue support to preshuffle GEMM: bias + ReLU/SiLU/GeLU …
andyluo7 Apr 20, 2026
bdcfc1e
[Fix] eliminate llvm unsupported type (Float8/4) before llvm conversi…
sjfeng1999 Apr 20, 2026
c1ea698
[ROCDL] Add CDNA4_MFMAScaleType (#417)
sjfeng1999 Apr 21, 2026
898c8d3
Adjust allreduce CI benchmark thresholds and replace error column wit…
yanboshao Apr 21, 2026
0139940
[Agent] New skill: add-target-atom-op (#423)
sjfeng1999 Apr 21, 2026
3f7b6b5
Port v2 gemm to main (#422)
coderfeli Apr 22, 2026
ff6df13
[OPT] Add fly-int-swizzle-simplify pass (#427)
sjfeng1999 Apr 23, 2026
25510d1
Implement MLA decode fwd nh128 a8w8 kernel with FlyDSL. (#403)
ruanjm Apr 23, 2026
5191746
[Perf] Port mixed_moe kernel optimizations for stage1/stage2 (#388)
lalala-sh Apr 23, 2026
19d6f6e
improve fused_rope kernel (#416)
amd-weisun Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
472 changes: 472 additions & 0 deletions .claude/skills/add-target-atom-op/SKILL.md

Large diffs are not rendered by default.

82 changes: 70 additions & 12 deletions .claude/skills/flydsl-kernel-authoring/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,39 @@ for i in range(runtime_value):
...
```

### Frontend Semantic Restrictions
When writing or reviewing `@flyc.kernel` / `@flyc.jit` code, proactively avoid these patterns because they can conflict with MLIR construction even if they look valid in plain Python.

1. **Do not define values inside `if/else` and use them later outside the branch.** Keep a single explicit definition path.
```python
if cond:
dst = a
else:
dst = b
use(dst) # avoid this pattern
```

2. **Do not mutate captured outer variables inside nested helper functions.** Read-only closure capture is acceptable, but writes should go through explicit parameters and return values.
```python
def kernel():
acc = fx.Float32(0.0)

def helper(acc):
acc = acc + fx.Float32(1.0)
return acc

acc = helper(acc)
```

3. **Avoid early `return`, and do not place `return` / `yield` inside `if/else` branches.** Prefer a single explicit exit so the frontend can determine result types.
```python
if cond:
out = v0
else:
out = v1
return out
```

### scf.for with Loop-Carried Values (Software Pipelining)

Use `init=` on `range()` to create an `scf.for` with explicit SSA phi nodes for loop-carried state. This is required for software pipelining (prefetch patterns) where data must flow across iterations.
Expand Down Expand Up @@ -300,27 +333,52 @@ result = arith.select(cond, true_val, false_val)
is_less = arith.cmpf(a, b, predicate="olt") # ordered less-than
```

### Vector Arithmetic (IMPORTANT)
All arith ops (`addf`, `mulf`, `negf`, `maximumf`, `cmpf`, `select`) work on **both scalars and vectors**.
To broadcast a scalar to a vector, use `arith.constant_vector`:
### Internal Types: Vector and Numeric (PREFERRED)

Use FlyDSL's internal typed system instead of raw MLIR ops. The `Vector` class wraps `vector<NxTy>` with operator overloading and type-safe methods.

```python
from flydsl._mlir.ir import VectorType
from flydsl.expr.typing import Vector as Vec, Float32, Float16, BFloat16

# Wrap raw vector values
acc = Vec(frag_C.load()) # vector<Nxf32> → Vector with * / + operators

# Indexing (replaces vector.extract)
val = acc[idx] # returns Float32 scalar

# Create a splat constant vector (e.g., all 2.0)
vec_type = VectorType.get([vec_width], fx.T.f32())
scale_vec = arith.constant_vector(2.0, vec_type)
# Bitcast (replaces vector.bitcast)
v_f32 = Vec(raw_vec).bitcast(Float32) # vector<Nxi32> → vector<Nxf32>

# Now use it with vector ops
vA = fx.memref_load_vec(rA) # load vec from register
vC = arith.mulf(vA, scale_vec) # element-wise scale
# Type conversion (replaces arith.trunc_f / arith.ext_f)
bf16_val = f32_val.to(BFloat16) # f32 → bf16

# Arithmetic — use Python operators, not arith.mulf/addf
result = (val * scale_a) * scale_b # auto-dispatches to mulf

# Splat constant vector
zeros = Vec.filled(N, 0.0, Float32)

# Index cast — use fx.Int32 instead of arith.index_cast
idx = fx.Int32(gpu.block_id("x") * tile_m)
```

**Prefer internal types over raw ops:**
| Raw MLIR op | Internal type equivalent |
|-------------|------------------------|
| `vector.extract(v, static_position=[i], ...)` | `Vec(v)[i]` |
| `vector.bitcast(target_ty, v)` | `Vec(v).bitcast(Float32)` |
| `arith.trunc_f(ty, v)` | `v.to(BFloat16)` |
| `arith.mulf(a, b)` | `a * b` |
| `arith.addf(a, b)` | `a + b` |
| `arith.index_cast(T.i32, v)` | `fx.Int32(v)` |

Still use `arith.constant_vector` for splat and `vector.from_elements` for building vectors from scalars (no Vector equivalent yet).

### Arith Ops Availability Table
| Operation | Function | Works on Vectors | Notes |
|-----------|----------|-----------------|-------|
| Add | `arith.addf(a, b)` | Yes | |
| Multiply | `arith.mulf(a, b)` | Yes | |
| Add | `a + b` or `arith.addf(a, b)` | Yes | |
| Multiply | `a * b` or `arith.mulf(a, b)` | Yes | |
| Negate | `arith.negf(a)` | Yes | |
| Max | `arith.maximumf(a, b)` | Yes | Good for ReLU |
| Compare | `arith.cmpf(a, b, pred)` | Yes | Returns i1/vec<i1> |
Expand Down
33 changes: 27 additions & 6 deletions .claude/skills/port-to-layout-api/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Read the kernel and classify each buffer_load/buffer_store:
| Pattern | Layout API Port | Example |
|---------|----------------|---------|
| Contiguous vec load along innermost dim | `make_buffer_tensor` + `BufferCopy128b` | Load 8xf16 from row |
| Scalar load (vec_width=1) of metadata | Keep as `buffer_ops.buffer_load` | Position/slot/mask loads |
| Scalar load (vec_width=1) | `make_buffer_tensor` + `BufferCopy32b`/`BufferCopy16b` | Scale/metadata loads |
| Scattered store (non-contiguous layout) | Keep as `buffer_ops.buffer_store` | Non-flash value_cache |
| Contiguous vec store along innermost dim | `make_buffer_tensor` + `BufferCopy` | Store 8xf16 to output |

Expand Down Expand Up @@ -138,12 +138,33 @@ if is_valid:
_store_vec(val, out_div, idx)
```

### Step 6: Keep Scalar Accesses as buffer_ops
### Step 6: Scalar Loads via Layout API

Not everything should use the layout API. Keep `buffer_ops` for:
- Scalar metadata loads: `buffer_ops.buffer_load(rsrc, idx, vec_width=1, dtype=T.i32)`
- Scattered stores where elements are non-contiguous in memory
- Single-element stores (e.g., writing one scale value per block)
Scalar loads (vec_width=1) also work through the layout API:

```python
buf = fx.rocdl.make_buffer_tensor(tensor, max_size=True)
copy_atom_s = fx.make_copy_atom(fx.rocdl.BufferCopy32b(), 32) # f32 scalar
scalar_reg_ty = fx.MemRefType.get(T.f32, fx.LayoutType.get(1, 1), fx.AddressSpace.Register)
scalar_reg_lay = fx.make_layout(1, 1)
div = fx.logical_divide(buf, fx.make_layout(1, 1))

def load_scalar(index):
r = fx.memref_alloca(scalar_reg_ty, scalar_reg_lay)
fx.copy_atom_call(copy_atom_s, fx.slice(div, (None, fx.Int32(index))), r)
return Vec(fx.memref_load_vec(r))[0] # extract scalar from vector<1xf32>
```

Scalar stores work the same way (reverse src/dst):
```python
def store_scalar(index, val):
r = fx.memref_alloca(scalar_reg_ty, scalar_reg_lay)
fx.memref_store_vec(Vec.filled(1, val, Float32), r)
fx.copy_atom_call(copy_atom_s, r, fx.slice(div, (None, fx.Int32(index))))
```

Keep `buffer_ops` only for:
- Scattered stores where elements are truly non-contiguous in memory

### Step 7: Remove Dead Code

Expand Down
179 changes: 178 additions & 1 deletion .github/workflows/flydsl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ on:
- main
workflow_dispatch:

permissions:
contents: read
actions: read
pull-requests: read

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
Expand All @@ -19,10 +24,18 @@ env:
GITHUB_COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.event.head_commit.id }}

jobs:
# ---------------------------------------------------------------------------
# Single-GPU tests: kernels, unit, examples, MLIR FileCheck, benchmarks.
# Runs on 1-GPU and Navi runners only.
# ---------------------------------------------------------------------------
test:
strategy:
matrix:
runners: [ 'linux-flydsl-mi325-1', 'linux-flydsl-mi355-1', 'linux-flydsl-navi-2' ]
runners: [
'linux-flydsl-mi325-1',
'linux-flydsl-mi355-1',
'linux-flydsl-navi-2',
]
fail-fast: false
runs-on: ${{ matrix.runners }}
steps:
Expand Down Expand Up @@ -169,3 +182,167 @@ jobs:
run: |
docker stop flydsl_test
docker rm flydsl_test

# ---------------------------------------------------------------------------
# Multi-GPU allreduce tests: ONLY for 8-GPU runners.
# Runs on BOTH linux-flydsl-mi325-8 AND linux-flydsl-mi355-8 independently.
# fail-fast: false ensures both runners always complete even if one fails.
# ---------------------------------------------------------------------------
multi-gpu:
needs: test
name: Multi-GPU AllReduce Tests (${{ matrix.runners }})
timeout-minutes: 120
strategy:
matrix:
runners: [
'linux-flydsl-mi325-8',
'linux-flydsl-mi355-8',
]
fail-fast: false
runs-on: ${{ matrix.runners }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
repository: ${{ env.GITHUB_REPO_NAME }}
ref: ${{ env.GITHUB_COMMIT_SHA }}
path: flydsl-test

- name: Start CI container
run: |
echo "Clean up containers..."
docker ps -aq -f name=flydsl_test | xargs -r docker stop | xargs -r docker rm || true

echo "Start CI container..."
if [ -f "/etc/podinfo/gha-render-devices" ]; then
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
else
DEVICE_FLAG="--device /dev/dri"
fi

docker run -dt --network=host --user root --device=/dev/kfd $DEVICE_FLAG \
-v "${GITHUB_WORKSPACE:-$PWD}/flydsl-test:/flydsl-test" \
--ipc=host --group-add video \
--shm-size 16g \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-w /flydsl-test \
--name flydsl_test \
${{ env.DOCKER_IMAGE }}
env:
GITHUB_WORKSPACE: ${{ github.workspace }}

- name: Install dependencies
run: |
docker exec flydsl_test bash -c "apt-get update && apt-get install -y cmake build-essential patchelf"
docker exec flydsl_test bash -c "python3 -m pip install -U pip setuptools wheel"
docker exec flydsl_test bash -c "python3 -m pip install ninja>=1.11.1"
docker exec flydsl_test bash -c "python3 -m pip install -U 'hypothesis>=6.82.0'"
docker exec flydsl_test bash -c "git config --global --add safe.directory /flydsl-test && cd /flydsl-test && git log"

- name: Restore cached MLIR install tarball (if available)
id: mlir-cache
uses: actions/cache@v4
with:
path: mlir_install.tgz
key: mlir-install-${{ matrix.runners }}-${{ hashFiles('flydsl-test/thirdparty/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/CMakeLists.txt', 'flydsl-test/.github/workflows/flydsl.yaml') }}

- name: Use cached MLIR install tarball (skip LLVM build)
if: steps.mlir-cache.outputs.cache-hit == 'true'
run: |
ls -lh mlir_install.tgz
docker cp mlir_install.tgz flydsl_test:/tmp/mlir_install.tgz
docker exec flydsl_test bash -c "rm -rf /llvm-project/mlir_install && mkdir -p /llvm-project && tar -xzf /tmp/mlir_install.tgz -C /llvm-project"
docker exec flydsl_test bash -c "ls -la /llvm-project/mlir_install/lib/cmake/mlir"

- name: Build LLVM
if: steps.mlir-cache.outputs.cache-hit != 'true'
run: |
set -ex
docker exec flydsl_test bash -c "cd /flydsl-test && bash scripts/build_llvm.sh"
docker exec flydsl_test bash -c "ls -la /llvm-project/mlir_install/lib/cmake/mlir"
docker cp flydsl_test:/llvm-project/mlir_install.tgz ./mlir_install.tgz || true

- name: Build FlyDSL (uses MLIR install prefix)
run: |
docker exec flydsl_test bash -c "export MLIR_PATH=/llvm-project/mlir_install && cd /flydsl-test && python3 -m pip install -e . --use-pep517"

- name: Run multi-GPU allreduce tests
timeout-minutes: 30
run: |
docker exec flydsl_test bash -c "
cd /flydsl-test
python3 -m pytest tests/kernels/test_allreduce.py \
-m multi_gpu -v --no-header --tb=short
"

- name: Run allreduce benchmark (PR)
timeout-minutes: 30
run: |
docker exec flydsl_test bash -c "
cd /flydsl-test
python3 tests/kernels/test_allreduce.py \
--world_size 8 --iters 51 --warmup 5 \
--allreduce_impl flydsl --mode cudagraph \
--shapes '2,7168,fp16;32,8192,fp32;128,8192,fp16;1024,7168,bf16;4096,8192,bf16' \
--output_csv /tmp/bench_pr.csv
"

- name: Build main branch baseline
id: build-main
timeout-minutes: 20
continue-on-error: true
run: |
docker exec flydsl_test bash -c "
cd /flydsl-test
git fetch origin main --depth=1
git worktree add /tmp/flydsl-main origin/main
cd /tmp/flydsl-main
export MLIR_PATH=/llvm-project/mlir_install
python3 -m pip install -e . --use-pep517 2>&1 | tail -5
"

- name: Run allreduce benchmark (main)
id: bench-main
if: steps.build-main.outcome == 'success'
timeout-minutes: 30
continue-on-error: true
run: |
docker exec flydsl_test bash -c "
cp /flydsl-test/tests/kernels/test_allreduce.py \
/tmp/flydsl-main/tests/kernels/test_allreduce.py
cd /tmp/flydsl-main
python3 tests/kernels/test_allreduce.py \
--world_size 8 --iters 51 --warmup 5 \
--allreduce_impl flydsl --mode cudagraph \
--shapes '2,7168,fp16;32,8192,fp32;128,8192,fp16;1024,7168,bf16;4096,8192,bf16' \
--output_csv /tmp/bench_main.csv
"

- name: Check performance regression (PR vs main)
if: steps.bench-main.outcome != 'skipped'
timeout-minutes: 5
run: |
docker exec flydsl_test bash -c "
cd /flydsl-test
python3 tests/kernels/compare_allreduce_benchmark.py \
/tmp/bench_main.csv /tmp/bench_pr.csv
"

- name: Show test logs
if: failure()
run: |
docker exec flydsl_test bash -c 'cd /tmp && tar czf /tmp/logs.tgz *.log 2>/dev/null || echo "no logs"'
docker cp flydsl_test:/tmp/logs.tgz . || true
if [ -f logs.tgz ]; then
tar -xzf logs.tgz || true
cat *.log || true
else
echo "logs.tgz not found; skipping log extraction"
fi

- name: Clean up
if: always()
run: |
docker stop flydsl_test
docker rm flydsl_test
4 changes: 2 additions & 2 deletions .github/workflows/publish-pypi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ jobs:
id: version
run: |
TAG_VERSION="${GITHUB_REF_NAME#v}"
PACKAGE_VERSION="$(awk -F'"' '/^_BASE_VERSION = / {print $2; exit}' python/flydsl/__init__.py)"
PACKAGE_VERSION="$(awk -F'"' '/^__version__ = / {print $2; exit}' python/flydsl/__init__.py)"

if [ -z "${PACKAGE_VERSION}" ]; then
echo "Failed to find _BASE_VERSION in python/flydsl/__init__.py" >&2
echo "Failed to find __version__ in python/flydsl/__init__.py" >&2
exit 1
fi

Expand Down
12 changes: 8 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# FlyDSL Project Guide

FlyDSL (Flexible Layout Python DSL) — a Python DSL and MLIR-based compiler stack for authoring high-performance GPU kernels with explicit layouts and tiling on AMD GPUs (MI300X/MI350/MI450).
FlyDSL (Flexible Layout Python DSL) — a Python DSL and MLIR-based compiler stack for authoring high-performance GPU kernels with explicit layouts and tiling on AMD GPUs (MI300X/MI350/MI355X/MI450).

## Repository Layout

Expand Down Expand Up @@ -59,9 +59,10 @@ FLYDSL_DUMP_IR=1 PYTHONPATH=./ python tests/kernels/test_pa.py # Dump MLIR IR at

| Arch | Chips | Wave size | MMA | Key features |
|---|---|---|---|---|
| **CDNA3** | gfx942/gfx950 (MI300X) | 64 | MFMA | BufferCopy, preshuffle GEMM |
| **RDNA** | gfx10xx/gfx11xx/gfx12xx | 32 | WMMA | RDNA-specific GEMM |
| **gfx1250** | MI400 | 32 | WMMA | TDM ops, FP8/FP4 GEMM, multi-stage pipeline |
| **CDNA3** | gfx942 (MI300X) | 64 | MFMA | BufferCopy, preshuffle GEMM |
| **CDNA4** | gfx950 (MI350/MI355X) | 64 | MFMA | MFMA_SCALE, FP4, 160KB LDS |
| **RDNA4** | gfx1201 (Radeon AI PRO R9700) | 32 | WMMA | RDNA-specific GEMM |
| **gfx1250** | MI450 | 32 | WMMA | TDM ops, FP8/FP4 GEMM, multi-stage pipeline |

## Key Conventions & Pitfalls

Expand All @@ -73,3 +74,6 @@ FLYDSL_DUMP_IR=1 PYTHONPATH=./ python tests/kernels/test_pa.py # Dump MLIR IR at
- **Layout API vs buffer_ops**: New kernels should use `fx.rocdl.make_buffer_tensor()` + `copy_atom_call` (layout API). Raw `buffer_ops.create_buffer_resource()` is legacy
- **Arch detection**: Use `from flydsl.runtime.device import get_rocm_arch`
- **`range` vs `range_constexpr`**: Use `range_constexpr` for compile-time unrolled loops; `range(start, stop, step, init=[...])` for `scf.for` with loop-carried values
- **Branch-local defs**: Do not define a value inside `if/else` and then use it after the branch. Hoist the variable or rewrite the logic so later uses see a single explicit definition path.
- **Nested helper captures**: Inside `@flyc.kernel` / `@flyc.jit`, nested helper functions must not mutate captured outer variables. Read-only capture is acceptable, but writes should go through explicit parameters / returns.
- **Single-exit control flow**: Avoid early `return`. Do not place `return` or `yield` inside `if/else` branches; keep a single explicit exit path so MLIR result types stay well-defined.
Loading
Loading