Releases · RIKEN-RCCS/GEMMul8

25 Jun 13:33

v3.0.5

6cfe0b3

v3.0.5 Latest

Latest

Update: test defaults and remove Hopper FP8 support

- Change the default number of test trials.
- Remove support for FP8-based emulation on Hopper architectures. The FP8-based emulation path is no longer available on Hopper, while INT8-based emulation remains supported.

Assets 2

11 Jun 04:45

UCHINO-Yuki

v3.0.4

884a287

v3.0.4

There are no performance, algorithmic, or implementation changes in this release.

Print host compiler information during make
Add test-driver options for selecting BLAS parameters

Assets 2

09 Jun 10:56

UCHINO-Yuki

v3.0.3

9c1dba4

v3.0.3

Fix: TRTRMM UPLO-dependent blocking

Fix an incorrect UPLO-dependent block-size selection in TRTRMM that caused inefficient block decomposition and degraded performance.
Fix TRTRMM scaling to avoid referring to uninitialized regions when only a triangular part of the intermediate matrix is computed.

Assets 2

08 Jun 12:14

UCHINO-Yuki

v3.0.2

396001d

v3.0.2

Fix: Resolve HIPCC compilation errors

Fix HIP/Clang compilation failures in
- scaling kernels,
- mod kernels, and
- test programs.

Assets 2

04 Jun 16:31

UCHINO-Yuki

v3.0.1

347041b

v3.0.1

Only small non-functional updates are included, such as comment-out adjustments, README updates, and minor test-program cleanup.
There are no performance, algorithmic, or implementation changes in this release.

Assets 2

04 Jun 06:47

UCHINO-Yuki

v3.0.0

ad1e9a7

v3.0.0

Major: Improve GEMM performance and add Level 3 BLAS/mixed-precision support

Improve the performance of the existing GEMM implementation:
- gemmul8::gemm, gemmul8::gemmLt
Add support for the following Level 3 BLAS-like matrix operations:
- SYMM (gemmul8::symm, gemmul8::symmLt)
- SYRK (gemmul8::syrk, gemmul8::syrkLt)
- SYR2K (gemmul8::syr2k, gemmul8::syr2kLt)
- SYRKX (gemmul8::syrkx, gemmul8::syrkxLt)
- HERK (gemmul8::herk, gemmul8::herkLt)
- HER2K (gemmul8::her2k, gemmul8::her2kLt)
- HERKX (gemmul8::herkx, gemmul8::herkxLt)
- TRMM (gemmul8::trmm, gemmul8::trmmLt)
- TRSM (gemmul8::trsm, gemmul8::trsmLt)
- TRTRMM (gemmul8::trtrmm, gemmul8::trtrmmLt): triangular-by-triangular matrix multiplication
Add support for mixed-precision execution
Add workspace-query support by calling GEMMul8 routines with work == nullptr
Extend gemmul8::workSize to support the routines listed above except TRSM
Add gemmul8::workSizeTrsm for TRSM workspace-size calculation
Add TRSM block-size control APIs for the internal blocked algorithm:
- gemmul8::set_block_size_trsm(int nB)
- gemmul8::get_block_size_trsm()
Add overload (Hook Mode) support for the routines listed above
Add overload (Hook Mode) support for _64, 3m, and 3m_64 variants where applicable
Change the GEMMul8 routine argument type from unsigned num_moduli to int num_moduli

Assets 2

06 Apr 06:27

UCHINO-Yuki

v2.0.19

3115e70

v2.0.19

Fix: correct CUDA regression introduced by FP8-FNUZ fix

Fixed a CUDA regression introduced by the previous FP8-FNUZ fix.

Assets 2

06 Apr 04:34

UCHINO-Yuki

v2.0.18

541433d

v2.0.18

Fix: resolve FP8-FNUZ bug on AMD CDNA3

On CDNA3, the FP8 format behavior differs from the commonly assumed definition, which could cause NaN generation during upward rounding.
This change avoids that issue.

Assets 2

05 Apr 19:45

UCHINO-Yuki

v2.0.17

0fb35ed

v2.0.17

Fix: constant in find_max.hpp

Assets 2

01 Apr 03:54

UCHINO-Yuki

v2.0.16

49f2d43

v2.0.16

modified test_flops.hpp

Assets 2

Uh oh!

Releases: RIKEN-RCCS/GEMMul8

v3.0.5

Uh oh!

v3.0.4

Uh oh!

v3.0.3

Uh oh!

v3.0.2

Uh oh!

v3.0.1

Uh oh!

v3.0.0

Uh oh!

v2.0.19

Uh oh!

v2.0.18

Uh oh!

v2.0.17

Uh oh!

v2.0.16

Uh oh!