Skip to content

Releases: RIKEN-RCCS/GEMMul8

v3.0.5

25 Jun 13:33

Choose a tag to compare

Update: test defaults and remove Hopper FP8 support

- Change the default number of test trials.
- Remove support for FP8-based emulation on Hopper architectures. The FP8-based emulation path is no longer available on Hopper, while INT8-based emulation remains supported.

v3.0.4

11 Jun 04:45

Choose a tag to compare

There are no performance, algorithmic, or implementation changes in this release.

  • Print host compiler information during make
  • Add test-driver options for selecting BLAS parameters

v3.0.3

09 Jun 10:56

Choose a tag to compare

Fix: TRTRMM UPLO-dependent blocking

  • Fix an incorrect UPLO-dependent block-size selection in TRTRMM that caused inefficient block decomposition and degraded performance.
  • Fix TRTRMM scaling to avoid referring to uninitialized regions when only a triangular part of the intermediate matrix is computed.

v3.0.2

08 Jun 12:14

Choose a tag to compare

Fix: Resolve HIPCC compilation errors

Fix HIP/Clang compilation failures in
- scaling kernels,
- mod kernels, and
- test programs.

v3.0.1

04 Jun 16:31

Choose a tag to compare

Only small non-functional updates are included, such as comment-out adjustments, README updates, and minor test-program cleanup.
There are no performance, algorithmic, or implementation changes in this release.

v3.0.0

04 Jun 06:47

Choose a tag to compare

Major: Improve GEMM performance and add Level 3 BLAS/mixed-precision support

  • Improve the performance of the existing GEMM implementation:
    • gemmul8::gemm, gemmul8::gemmLt
  • Add support for the following Level 3 BLAS-like matrix operations:
    • SYMM (gemmul8::symm, gemmul8::symmLt)
    • SYRK (gemmul8::syrk, gemmul8::syrkLt)
    • SYR2K (gemmul8::syr2k, gemmul8::syr2kLt)
    • SYRKX (gemmul8::syrkx, gemmul8::syrkxLt)
    • HERK (gemmul8::herk, gemmul8::herkLt)
    • HER2K (gemmul8::her2k, gemmul8::her2kLt)
    • HERKX (gemmul8::herkx, gemmul8::herkxLt)
    • TRMM (gemmul8::trmm, gemmul8::trmmLt)
    • TRSM (gemmul8::trsm, gemmul8::trsmLt)
    • TRTRMM (gemmul8::trtrmm, gemmul8::trtrmmLt): triangular-by-triangular matrix multiplication
  • Add support for mixed-precision execution
  • Add workspace-query support by calling GEMMul8 routines with work == nullptr
  • Extend gemmul8::workSize to support the routines listed above except TRSM
  • Add gemmul8::workSizeTrsm for TRSM workspace-size calculation
  • Add TRSM block-size control APIs for the internal blocked algorithm:
    • gemmul8::set_block_size_trsm(int nB)
    • gemmul8::get_block_size_trsm()
  • Add overload (Hook Mode) support for the routines listed above
  • Add overload (Hook Mode) support for _64, 3m, and 3m_64 variants where applicable
  • Change the GEMMul8 routine argument type from unsigned num_moduli to int num_moduli

v2.0.19

06 Apr 06:27

Choose a tag to compare

Fix: correct CUDA regression introduced by FP8-FNUZ fix

Fixed a CUDA regression introduced by the previous FP8-FNUZ fix.

v2.0.18

06 Apr 04:34

Choose a tag to compare

Fix: resolve FP8-FNUZ bug on AMD CDNA3

On CDNA3, the FP8 format behavior differs from the commonly assumed definition, which could cause NaN generation during upward rounding.
This change avoids that issue.

v2.0.17

05 Apr 19:45

Choose a tag to compare

Fix: constant in find_max.hpp

v2.0.16

01 Apr 03:54

Choose a tag to compare

modified test_flops.hpp