Releases: RIKEN-RCCS/GEMMul8
Releases · RIKEN-RCCS/GEMMul8
v3.0.5
v3.0.4
There are no performance, algorithmic, or implementation changes in this release.
- Print host compiler information during
make - Add test-driver options for selecting BLAS parameters
v3.0.3
Fix: TRTRMM UPLO-dependent blocking
- Fix an incorrect UPLO-dependent block-size selection in TRTRMM that caused inefficient block decomposition and degraded performance.
- Fix TRTRMM scaling to avoid referring to uninitialized regions when only a triangular part of the intermediate matrix is computed.
v3.0.2
Fix: Resolve HIPCC compilation errors Fix HIP/Clang compilation failures in - scaling kernels, - mod kernels, and - test programs.
v3.0.1
Only small non-functional updates are included, such as comment-out adjustments, README updates, and minor test-program cleanup.
There are no performance, algorithmic, or implementation changes in this release.
v3.0.0
Major: Improve GEMM performance and add Level 3 BLAS/mixed-precision support
- Improve the performance of the existing GEMM implementation:
gemmul8::gemm,gemmul8::gemmLt
- Add support for the following Level 3 BLAS-like matrix operations:
- SYMM (
gemmul8::symm,gemmul8::symmLt) - SYRK (
gemmul8::syrk,gemmul8::syrkLt) - SYR2K (
gemmul8::syr2k,gemmul8::syr2kLt) - SYRKX (
gemmul8::syrkx,gemmul8::syrkxLt) - HERK (
gemmul8::herk,gemmul8::herkLt) - HER2K (
gemmul8::her2k,gemmul8::her2kLt) - HERKX (
gemmul8::herkx,gemmul8::herkxLt) - TRMM (
gemmul8::trmm,gemmul8::trmmLt) - TRSM (
gemmul8::trsm,gemmul8::trsmLt) - TRTRMM (
gemmul8::trtrmm,gemmul8::trtrmmLt): triangular-by-triangular matrix multiplication
- SYMM (
- Add support for mixed-precision execution
- Add workspace-query support by calling GEMMul8 routines with
work == nullptr - Extend
gemmul8::workSizeto support the routines listed above except TRSM - Add
gemmul8::workSizeTrsmfor TRSM workspace-size calculation - Add TRSM block-size control APIs for the internal blocked algorithm:
gemmul8::set_block_size_trsm(int nB)gemmul8::get_block_size_trsm()
- Add overload (Hook Mode) support for the routines listed above
- Add overload (Hook Mode) support for
_64,3m, and3m_64variants where applicable - Change the GEMMul8 routine argument type from
unsigned num_modulitoint num_moduli
v2.0.19
Fix: correct CUDA regression introduced by FP8-FNUZ fix Fixed a CUDA regression introduced by the previous FP8-FNUZ fix.
v2.0.18
Fix: resolve FP8-FNUZ bug on AMD CDNA3 On CDNA3, the FP8 format behavior differs from the commonly assumed definition, which could cause NaN generation during upward rounding. This change avoids that issue.
v2.0.17
Fix: constant in find_max.hpp
v2.0.16
modified test_flops.hpp