Ledir IM=0 DP matrix caching by CarlosPenaDePedro · Pull Request #409 · ecmwf-ifs/ectrans

CarlosPenaDePedro · 2026-05-28T15:23:27Z

Description

This PR adds persistent double-precision caches for the KM=0 DGEMM path in LEDIR.

Specifically, it introduces persistent allocatable arrays:

RPNMA_DGEMM
RPNMS_DGEMM

to avoid repeated allocations/recomputations in the DGEMM execution path for the KM=0 case. This is intended to reduce overhead and help improve load imbalance observed in this path.

More information in issue #392

The change passes the GitHub Actions test suite.

Contributor Declaration

By opening this pull request, I affirm the following:

All authors agree to the Contributor License Agreement.
The code follows the project's coding standards.
I have performed self-review and added comments where needed.
I have added or updated tests to verify that my changes are effective and functional.
I have run all existing tests and confirmed they pass.

samhatfield · 2026-05-28T23:48:17Z

Thanks for this amazing work @CarlosPenaDePedro - we will start the review when we have time, but definitely well before the October deadline you mention 🙏🏻

Copilot

Pull request overview

This PR optimizes the single-precision KM=0 LEDIR DGEMM path by caching double-precision Legendre polynomial matrices at setup time, reducing repeated allocation and SP→DP conversion overhead.

Changes:

Adds persistent RPNMA_DGEMM and RPNMS_DGEMM caches to the CPU FLT resolution wrapper.
Populates the caches in SULEG for IM == 0 when running single precision.
Updates LEDIR to use the cached double-precision matrices in the KM=0 DGEMM path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/trans/cpu/internal/tpm_flt.F90`	Adds persistent double-precision cache arrays to FLT state.
`src/trans/cpu/internal/suleg_mod.F90`	Allocates and fills the KM/IM=0 DGEMM caches during Legendre setup.
`src/trans/cpu/internal/ledir_mod.F90`	Replaces per-call matrix allocation/conversion with cached double-precision matrices.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

samhatfield · 2026-05-28T23:55:48Z

-             END DO
-             CALL GEMM('T','N',ILA,KIFC,KDGLU,1.0_JPRD,ZRPNMA,KDGLU,&
+
+             CALL GEMM('T','N',ILA,KIFC,KDGLU,1.0_JPRD,S%RPNMA_DGEMM,KDGLU,&


Oof, and that's why we use Copilot. I will take a look at this.

-             CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,ZRPNMS,KDGLU,&
-                  &ZB_D,KDGLU,0._JPRD,ZCS_D,ILS)
+
+             CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,S%RPNMS_DGEMM,KDGLU,&


samhatfield · 2026-05-29T01:18:32Z

Don't worry about the failing tests. Once #407 is merged, you can then rebase this again against develop and that should resolve them.

samhatfield · 2026-06-03T05:12:23Z

Could you rebase against develop again @CarlosPenaDePedro? That should fix the failing tests.

samhatfield · 2026-06-03T06:39:09Z

Todo:

Rebase against develop
Approve ECMWF contributions to Ledir_IM0_caching_rebased CarlosPenaDePedro/ectrans#1 which tries to mitigate repeated code and also some style fixes to LEDIR
Check all comments above
Verify that all checks pass
For @samhatfield, check potential clash with READ_LEGPOL code path

Remove unnecessary caching in the FLT path and remove unnecessary casts Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

samhatfield · 2026-06-04T01:50:45Z

All tests passed. Well done @CarlosPenaDePedro, you are off the critical path and can now relax 😝

I still need to take a closer look at the codepath for reading / writing the Legendre polynomials to see if there's a clash.

Also I'd like to repeat my benchmarking from earlier as a sanity check. Load balance is one thing, but it would be good to check that there is no negative unexpected impact on overall wall times etc. I'm travelling at the moment but hopefully next week I'll find some time.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

+          CALL GEMM('T', 'N', ILA, KIFC, KDGLU, 1.0_JPRD, S%RPNMA_DGEMM, KDGLU, ZB_D, KDGLU, &
+              &       0._JPRD, ZCA_D, ILA)


+          CALL GEMM('T', 'N', ILS, KIFC, KDGLU, 1.0_JPRD, S%RPNMS_DGEMM, KDGLU, ZB_D, KDGLU, &
+            &       0._JPRD, ZCS_D, ILS)


samhatfield · 2026-06-09T08:58:08Z

The issue Copilot identifies above is the following:

ecTrans can read pre-computed Legendre polynomials from file or a memory buffer. In this case, the computation of the polynomials is skipped. That includes the allocation and initialisation of S%RPNMA_DGEMM and S%RPNMS_DGEMM. In that case, LEDIR will segfault because those arrays are still accessed.

The proper fix is to make sure the cached arrays are initialised even when polynomials are read from file or buffer. Rather than have the double-precision IM=0 arrays written to and read from disk, I think we should simply copy the single-precision polynomials which are read into those cached arrays. This will incur a single -> double upcast which is different to what we have now, where the double-precision source arrays for the polynomials are stored directly in the cache. Then, we only have to make a slight modification in READ_LEGPOL_MOD.

It's actually closer to what we do in the develop branch. There in LEDIR we copy the stored single-precision polynomials into a double-precision buffer (so again, a single -> double upcast). This works, and implies that the problem with the IM=0 mode is not that the polynomials are too low in precision but rather the GEMM itself is too low in precision.

CarlosPenaDePedro · 2026-06-09T09:21:36Z

In that case we could to keep the behaviour consistent across setup paths, we could also populate the SULEG S%RPNMA_DGEMM and S%RPNMS_DGEMM by upcasting the already stored single-precision RPNMA/RPNMS arrays, instead of filling it directly from the double-precision source values. This would make the SULEG and READ_LEGPOL paths behave the same way, and should also be closer to the previous implementation, where LEDIR promoted the single-precision matrix to double at runtime.

The issue Copilot identifies above is the following:

ecTrans can read pre-computed Legendre polynomials from file or a memory buffer. In this case, the computation of the polynomials is skipped. That includes the allocation and initialisation of S%RPNMA_DGEMM and S%RPNMS_DGEMM. In that case, LEDIR will segfault because those arrays are still accessed.

The proper fix is to make sure the cached arrays are initialised even when polynomials are read from file or buffer. Rather than have the double-precision IM=0 arrays written to and read from disk, I think we should simply copy the single-precision polynomials which are read into those cached arrays. This will incur a single -> double upcast which is different to what we have now, where the double-precision source arrays for the polynomials are stored directly in the cache. Then, we only have to make a slight modification in READ_LEGPOL_MOD.

It's actually closer to what we do in the develop branch. There in LEDIR we copy the stored single-precision polynomials into a double-precision buffer (so again, a single -> double upcast). This works, and implies that the problem with the IM=0 mode is not that the polynomials are too low in precision but rather the GEMM itself is too low in precision.

samhatfield · 2026-06-09T09:34:32Z

In that case we could to keep the behaviour consistent across setup paths, we could also populate the SULEG S%RPNMA_DGEMM and S%RPNMS_DGEMM by upcasting the already stored single-precision RPNMA/RPNMS arrays, instead of filling it directly from the double-precision source values. This would make the SULEG and READ_LEGPOL paths behave the same way, and should also be closer to the previous implementation, where LEDIR promoted the single-precision matrix to double at runtime.

Yes, that's exactly what I was thinking.

This is consistent with the current behaviour in develop.

github-actions Bot added the contributor label May 28, 2026

CarlosPenaDePedro changed the title ~~Ledir legendre IM 0 DP caching first imp~~ Ledir IM=0 DP matrix caching May 28, 2026

samhatfield requested review from Copilot and samhatfield May 28, 2026 23:48

samhatfield added the approved-for-ci label May 28, 2026

Copilot started reviewing on behalf of samhatfield May 28, 2026 23:48 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

samhatfield reviewed May 29, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

github-actions Bot removed the approved-for-ci label May 29, 2026

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 3, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

CarlosPenaDePedro added 3 commits June 3, 2026 09:30

Ledir legendre IM 0 DP caching first imp

9c14de1

deallocate persistent DGEMM cache matrices

16fcb23

Remove unnecessary DGEMM cache reallocation checks

2c238a5

CarlosPenaDePedro force-pushed the Ledir_IM0_caching_rebased branch from 09880fd to 2c238a5 Compare June 3, 2026 07:34

CarlosPenaDePedro and others added 5 commits June 3, 2026 09:40

Apply suggestions from code review

9d0c027

Remove unnecessary caching in the FLT path and remove unnecessary casts Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

Introduce PACK_FOR_GEMM to minimise repeated code

d877dc9

Tweak code formatting

54cc73f

Don't calculate ISKIP twice

b07112b

Declare explicit shape for POUT

e327731

samhatfield added 3 commits June 4, 2026 00:18

Fix missing type error

9207d72

Tidy up further

8b8984e

Fix indents

a6a85a9

github-actions Bot assigned marsdeno Jun 4, 2026

github-actions Bot requested review from marsdeno and samhatfield June 4, 2026 00:19

samhatfield added the approved-for-ci label Jun 4, 2026

samhatfield reviewed Jun 4, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 4, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

samhatfield reviewed Jun 4, 2026

View reviewed changes

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated

Remove empty lines

183b502

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

github-actions Bot removed the approved-for-ci label Jun 4, 2026

github-actions Bot requested a review from samhatfield June 4, 2026 00:22

samhatfield added the enhancement New feature or request label Jun 4, 2026

samhatfield requested a review from Copilot June 9, 2026 08:17

Copilot started reviewing on behalf of samhatfield June 9, 2026 08:17 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Prevent unnecessary allocation of IM=0 DP polys

a2d5db4

samhatfield added 2 commits June 9, 2026 09:46

Create DGEMM cached polynomials from SP version

3a16ea9

This is consistent with the current behaviour in develop.

Simplify initialisation of DGEMM poly caches

bcecc2a

samhatfield mentioned this pull request Jun 9, 2026

Add test for writing and reading Legendre polynomials from disk #413

Open

		CALL GEMM('T', 'N', ILA, KIFC, KDGLU, 1.0_JPRD, S%RPNMA_DGEMM, KDGLU, ZB_D, KDGLU, &
		& 0._JPRD, ZCA_D, ILA)

		CALL GEMM('T', 'N', ILS, KIFC, KDGLU, 1.0_JPRD, S%RPNMS_DGEMM, KDGLU, ZB_D, KDGLU, &
		& 0._JPRD, ZCS_D, ILS)

Conversation

CarlosPenaDePedro commented May 28, 2026

Description

Contributor Declaration

Uh oh!

samhatfield commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

samhatfield May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samhatfield commented May 29, 2026

Uh oh!

Uh oh!

samhatfield commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samhatfield commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samhatfield commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samhatfield commented Jun 9, 2026

Uh oh!

CarlosPenaDePedro commented Jun 9, 2026

Uh oh!

samhatfield commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

samhatfield commented Jun 3, 2026 •

edited

Loading