Skip to content

[VERSION] Init triton_v3.6.x branch from triton_v3.5.x#429

Open
zhzhcookie wants to merge 46 commits intotriton_v3.6.xfrom
flagtree_v3.6.x
Open

[VERSION] Init triton_v3.6.x branch from triton_v3.5.x#429
zhzhcookie wants to merge 46 commits intotriton_v3.6.xfrom
flagtree_v3.6.x

Conversation

@zhzhcookie
Copy link
Collaborator

No description provided.

SabYic and others added 30 commits March 13, 2026 09:24
---------

Co-authored-by: zhengyang <zhengyang@baai.ac.cn>
* [feat] support tle extension (#188)

* Support Hints: shared_memory

* Add a new pass 'ProcessSharedMemoryHintPass' for shared_memory hint

* [feat] : support tle extension
 1. Complete TLE implementation with comprehensive testing
 2. support tle.alloc tle.copy tle.local_load tle.local_store tle.pipeline
 3. xcore support tle pipeline

* [chore/refactor] add tle marker &&  move tle from language extra to experimental

* [chore] translate Chinese comments to English

---------

Co-authored-by: liuxin <liuxin01@hnu.edu.cn>

* [CI] Add auto code-format

* Fix auto commit

* Apply code-format changes

* Clean code & [CI] Add tle tests

* Fix tle pybind

* [HINTS] [TLE] Fix comments

* [HINTS] Fix LoadOp::build

* Apply code-format changes

* [TEST] Add --only_unit_test to tutorials

* [TEST] [CI/CD] Add --only_unit_test to tutorials and add to workflow

* Apply code-format changes

* [TEST] [CI/CD] Add --only_unit_test to tutorials and add to workflow

* [TEST] [CI/CD] Add --only_unit_test to tutorials and add to workflow

* Move tle mlir test suites to third_party/tle/test

* Apply code-format changes

* [CI] Move import sys

* Apply code-format changes

* Remove TLE test runner implementation

* Apply code-format changes

* [TLE] Add Copyright

* Apply code-format changes

---------

Co-authored-by: huanghaoXcore <huanghao@xcoresigma.com>
Co-authored-by: liuxin <liuxin01@hnu.edu.cn>
Co-authored-by: sunnycase <sunnycase@users.noreply.github.com>
Co-authored-by: zhengyang <zhengyang@baai.ac.cn>
Co-authored-by: zhzhcookie <zhzhcookie@users.noreply.github.com>
* [tle-lite] Add tle.load async lowering for GPU

* Add wgmma optimize for async load

* Add copyright

* Apply code-format changes

* Add python/tutorials/tle/01-sparse-mla.py

* Apply code-format changes

* Fix code format

---------

Co-authored-by: sunnycase <sunnycase@users.noreply.github.com>
---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
* Setup the framework of edsl

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* support scf.for loop

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* support softmax

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* support matmul

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* modified:   ../triton/experimental/flagtree/edsl/runtime.py (#216)

* add topk example skeleton

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* implement topk with edsl

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* replace while-loop-inside part with edsl

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* fix CMake Dependence bug (#247)

* [FEAT]enable use wheel as external LLVM (#248)

Resolves #211.

* update setup to support llvm from wheel

* allow attempts mul times in one python progress

* delete redundant code

* delete redundant code _ 2

* delete redundant code_3

* [BACKEND] Solve addresspace issue (#251)

resolve #209

* Reverse order of dimensions for memory descriptor

* Update typing.py

* support llvm.ptr as edsl input type

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* move the examples to tutorials

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* put edsl into tle

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* remove unused extension flag

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* tag modifications as tle raw

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* Refactor: Move LLVM wheel logic to setup_helper.py

* Apply code-format changes

---------

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
Co-authored-by: Phoenixtree <sunwenjia04@163.com>
Co-authored-by: starrryz <760668919@qq.com>
Co-authored-by: flagtree-bot <flagtree_ai@163.com>
Co-authored-by: zhzhcookie <zhengyang@baai.ac.cn>
* WIP topk[skip ci]

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* modified:   python/examples/flagtree/03-matrix-multiplication.py
	modified:   python/src/flagtree_ir.cc

* modified:   python/examples/flagtree/03-matrix-multiplication.py

* Update FlagTree IR and semantic implementation

* Add arg_type_hints attribute and type conversion validation

- Add arg_type_hints attribute to DSLRegionOp to store EDSL function type declarations
- Add comprehensive comments explaining TT IR type-> EDSL func type -> LLVM type conversion
- Add error reporting for unsupported types and mismatched address spaces
- All tests (example 01-03) passing

* Update comments in flagtree_ir.cc

* Current implementation before replacing loop with edsl1

* Add error reporting with parameter names and strict type consistency checks

- Add edsl_param_names attribute to store parameter names
- Print parameter names in error messages for easier debugging
- Add strict type consistency checks before type conversion
- Rename arg_type_hints to edsl_param_types for clarity

---------

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
Co-authored-by: SabYic <sunwenjia04@163.com>
Co-authored-by: lxy <15910307812@163.com>
Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
---------

Co-authored-by: zhzhcookie <zhengyang@baai.ac.cn>
* [CI] Fix check docs only

* [BUILD] Add PYTHONPATH when set_llvm_env

* [DOC] Fix doc link

* [BUILD] Fix plugin mode check
Updated the title to reflect the correct name of the extension.
* modified:   python/triton/experimental/tle/raw/mlir/codegen.py
	modified:   python/triton/experimental/tle/raw/mlir/utils.py

 Author:    SabYic <sunwenjia04@163.com>
 Date:      Mon Jan 26 12:42:41 2026 +0000

* Apply code-format changes

* Fix string encoding in global_string method

* modified:   python/triton/experimental/tle/raw/mlir/utils.py

* modified:   python/triton/experimental/tle/raw/mlir/utils.py

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
modified stride computation method for matrix mul
* initial design for vassert func

* implement vassert func in edsl

* Apply code-format changes

* remove anno

* add CI tests && restore work for fix issue#328

* Apply code-format changes

* [CI] export LLVM_SYSPATH to env for test

* utilize vassert in 5-topk && add test

* Apply code-format changes

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>
Co-authored-by: zhengyang <zhengyang@baai.ac.cn>
* Implement Local Pointers Support in TLE

- Introduced `tle.local_ptr()` and `tl.load/tl.store` for accessing shared/tensor memory through pointer tensors.
- Added `LocalPointersOp` to the TLE dialect for managing local pointers.
- Implemented conversion patterns for `LocalPointersOp` to LLVM.
- Enhanced the TLE README to reflect new local operations and usage examples.
- Created passes for assigning shared encodings to local pointers and inserting barriers between local pointer operations.
- Updated the C++ bridge to expose new builder methods for local pointers in Triton.
- Added tests and documentation for the new features in TLE.

* Apply code-format changes

* Fix tests

* Apply code-format changes

* Redesign tle.local_ptr

* Apply code-format changes

* Update sparse-mla

* Fix build

* Add topk & fft tutorials

* Update CI test

* Restore sparse mla

* Apply code-format changes

* Fix build

* Fix build

* Avoid bar.sync between atomics

* Apply code-format changes

* Update lib/Analysis/Alias.cpp

Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>

* Update lib/Dialect/TritonGPU/Transforms/Pipeliner/AssignLatencies.cpp

Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>

* Update third_party/tle/dialect/lib/Transforms/TleInsertLocalPointerBarriers.cpp

Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>

* Apply code-format changes

* Update

* Improve perf

* Apply code-format changes

* Fix _moe_realistic_shapes

* Vectorize moe_align_block_size_vllm_stage2_kernel

* More vectorize

* Fix

* Apply code-format changes

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>
* modified:   python/triton/experimental/tle/raw/mlir/codegen.py
	modified:   python/triton/experimental/tle/raw/mlir/utils.py

 Author:    SabYic <sunwenjia04@163.com>
 Date:      Mon Jan 26 12:42:41 2026 +0000

* Apply code-format changes

* Fix string encoding in global_string method

* modified:   python/triton/experimental/tle/raw/mlir/utils.py

* modified:   python/tutorials/03-matrix-multiplication.py
	modified:   python/tutorials/tle/raw/03-matrix-multiplication.py
	modified:   third_party/tle/dialect/lib/Conversion/TleToLLVM/ExtractOpToLLVM.cpp
	modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* Apply code-format changes

* modified:   python/tutorials/03-matrix-multiplication.py
	modified:   python/tutorials/tle/raw/03-matrix-multiplication.py

* Apply code-format changes

* modified:   python/tutorials/tle/raw/03-matrix-multiplication.py

* modified:   third_party/tle/dialect/lib/Conversion/TleToLLVM/ExtractOpToLLVM.cpp

* modified:   python/triton/experimental/tle/language/raw/core.py

* modified:   python/triton/experimental/tle/language/raw/core.py
	modified:   third_party/tle/dialect/include/IR/TleOps.td
	modified:   third_party/tle/triton_tle.cc
	modified:   third_party/tle/triton_tle_raw.cc
	modified:   third_party/tle/utils/lib/Protocol.cpp

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp
	modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* Apply code-format changes

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp
	modified:   third_party/tle/triton_tle_raw.cc
	modified:   third_party/tle/utils/lib/Protocol.cpp

	modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp
	modified:   third_party/tle/triton_tle_raw.cc

	modified:   third_party/tle/utils/lib/Protocol.cpp

* modified:   python/tutorials/tle/raw/04-hello-world.py

* Apply code-format changes

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/triton_tle_raw.cc
	modified:   third_party/tle/utils/lib/Protocol.cpp

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/utils/lib/Protocol.cpp

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp
	modified:   third_party/tle/triton_tle_raw.cc
	modified:   third_party/tle/utils/lib/Protocol.cpp

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/dialect/lib/Transforms/ConvertArgToMemDesc.cpp

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/triton_tle_raw.cc

* modified:   third_party/tle/triton_tle_raw.cc

* Apply code-format changes

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
* modified:   python/setup_tools/setup_helper.py
	modified:   python/setup_tools/utils/__init__.py
	modified:   python/setup_tools/utils/aipu.py
	modified:   python/setup_tools/utils/tools.py
	modified:   setup.py

* Apply code-format changes

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>
Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
sgjzfzzf and others added 13 commits March 13, 2026 09:51
Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
#362)

* [BUILD] Print color when building

* [BUILD] Add unpack .triton shell for offline-building
Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
* add hint_manager for cuda

* Apply code-format changes

* update hintmanager : remove backend&version config and so on

* Apply code-format changes

* add TODO, remove priority 3

* Apply code-format changes

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
* [TLERaw] revert llvm::CallOp related codes

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* [TLERaw] enable topk tle in ci

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* remove flake8 comments (#338)

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

---------

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
…ml files in the workflow folder on triton_v3.5.x (#397)

Co-authored-by: zhzhcookie <zhengyang@baai.ac.cn>
* [BACKEND] support enflame backend

* Apply code-format changes

* update according to reviewer's comments

* fix build issue of tle dialect and add MAX JOB 4

* remove 64 test

* Apply code-format changes

* correct llvm file name

* remove excutable attr according to pre commit

* update unit test

* remove fake .coveragerc

* update readme for latest

* exclude tle test binary

* fix opt not installed issue

* Apply code-format changes

* fix pre-commit

* Apply code-format changes

* fix opt not installed issue

* update readme with FlagTree instead of flagtree

* fix build issue in yaml

* fix build issue of uninstall triton

* offline case

* rollback warning as error change

* remove redundant space

---------

Co-authored-by: flagtree-bot <flagtree_ai@163.com>
* Add initial tle dist impl

* Add tle docs

* Implement submesh distributed barrier support in TLE

- Added support for submesh distributed barriers in the TLE dialect, allowing for more granular synchronization across submeshes.
- Introduced new attributes for the `DistributedBarrierOp` to handle group metadata, including `group_kind`, `group_rank`, `group_shape`, `group_axes`, and `group_mask`.
- Enhanced the `remote` function to validate and handle buffered tensors with appropriate metadata.
- Updated the `device_mesh` class to include launch shape and dimension names for better submesh handling.
- Implemented verification logic for the new attributes in `DistributedBarrierOp`.
- Added tests to ensure correct behavior of the new submesh barrier functionality and validation of remote buffered tensors.

* Add AxisInfoExt and RemotePointerUtils for Triton dialect enhancements

- Introduced AxisInfoExt.h and AxisInfoExt.cpp to extend axis information analysis with new visitors for LocalPointersOp and RemotePointersOp.
- Implemented RemotePointerUtils.h and RemotePointerUtils.cpp to handle remote pointer information extraction and inference of vector sizes/layouts.
- Added utility functions for managing remote pointer metadata and determining vectorization hints based on tensor properties.
- Enhanced axis analysis capabilities to support new Triton operations, improving optimization opportunities in the Triton dialect.

* Fix TLE local pointer encoding and retune fused launch params

* tle: add grid distributed barrier + optimize moe atomic fused

* remove tle docs from repo and ignore them

* ignore tle docs paths

* remove unintended triton_shared gitlink

* revert unnecessary cluster wait fence insertion

* refine markers and relocate tle alignment test

* Apply code-format changes

* restore core alignment test and split tle coverage

* update tle moe tuning and backend/lowering adjustments

* Apply code-format changes

* Fix

* [TLERaw] Revert `LLVM::CallOp`-Related Modification (#387)

* [TLERaw] revert llvm::CallOp related codes

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* [TLERaw] enable topk tle in ci

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* remove flake8 comments (#338)

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

---------

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* Apply code-format changes

* Update .gitignore to remove specific entries

Remove unnecessary files from .gitignore

* Clean up .gitignore by removing unused entries

Removed entries for Test and Agent from .gitignore

* Enhance TLE Support in TritonNVIDIAGPUToLLVM

- Added conditional compilation for TLE support in LoadStoreOpToLLVM.cpp, TargetInfo.cpp, TargetInfo.h, and TritonGPUToLLVM.cpp.
- Introduced new functions and modified existing ones to handle remote pointer information and shared memory access for TLE.
- Updated the TLE documentation to include examples for scalar shared-memory lookups and clarified the usage of local pointers.
- Ensured compatibility with both TLE and non-TLE builds by using preprocessor directives.

* Apply code-format changes

* Fix

* Refactor TLE-related code in TritonNVIDIAGPUToLLVM

- Removed unnecessary comments marking the beginning and end of TLE sections.
- Consolidated TLE-specific code under preprocessor directives to improve readability.
- Streamlined the inclusion of TLE headers and related logic in LoadStoreOpToLLVM, TargetInfo, and TritonGPUToLLVM files.
- Ensured consistent handling of shared and cluster shared pointers across various operations.
- Enhanced the clarity of cache policy handling in load and store operations.

* Apply code-format changes

---------

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
Co-authored-by: flagtree-bot <flagtree_ai@163.com>
Co-authored-by: Jinjie Liu <jjliu@baai.ac.cn>
…#400)

* insert syncthreads before tle_dslregion automatically

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* check whether copy exists before inserting nvvm.barrier0

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

---------

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
Co-authored-by: Phoenixtree <sunwenjia04@163.com>
#define TRITON_ATTR_DEFS

include "mlir/IR/EnumAttr.td"
include "mlir/IR/AttrTypeBase.td"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary?

}

} // namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tle?

auto ptrElems = unpackLLElements(loc, llPtr, rewriter);
assert(ptrElems.size() == numElems);
const bool isSharedPtr = isSharedPointerValue(ptrElems);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary?

@CLAassistant
Copy link

CLAassistant commented Mar 13, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
10 out of 11 committers have signed the CLA.

✅ SabYic
✅ zhzhcookie
✅ Galaxy1458
✅ sgjzfzzf
✅ sunnycase
✅ starrryz
✅ git-flyer
✅ baoqiliu
✅ i3wanna2
✅ flagtree-bot
❌ FlagTree


FlagTree seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.