Skip to content

Ledir IM=0 DP matrix caching #409

Open
CarlosPenaDePedro wants to merge 15 commits into
ecmwf-ifs:developfrom
CarlosPenaDePedro:Ledir_IM0_caching_rebased
Open

Ledir IM=0 DP matrix caching #409
CarlosPenaDePedro wants to merge 15 commits into
ecmwf-ifs:developfrom
CarlosPenaDePedro:Ledir_IM0_caching_rebased

Conversation

@CarlosPenaDePedro

Copy link
Copy Markdown

Description

This PR adds persistent double-precision caches for the KM=0 DGEMM path in LEDIR.

Specifically, it introduces persistent allocatable arrays:

RPNMA_DGEMM
RPNMS_DGEMM

to avoid repeated allocations/recomputations in the DGEMM execution path for the KM=0 case. This is intended to reduce overhead and help improve load imbalance observed in this path.

More information in issue #392

The change passes the GitHub Actions test suite.

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

@CarlosPenaDePedro CarlosPenaDePedro changed the title Ledir legendre IM 0 DP caching first imp Ledir IM=0 DP matrix caching May 28, 2026
@samhatfield

Copy link
Copy Markdown
Collaborator

Thanks for this amazing work @CarlosPenaDePedro - we will start the review when we have time, but definitely well before the October deadline you mention 🙏🏻

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the single-precision KM=0 LEDIR DGEMM path by caching double-precision Legendre polynomial matrices at setup time, reducing repeated allocation and SP→DP conversion overhead.

Changes:

  • Adds persistent RPNMA_DGEMM and RPNMS_DGEMM caches to the CPU FLT resolution wrapper.
  • Populates the caches in SULEG for IM == 0 when running single precision.
  • Updates LEDIR to use the cached double-precision matrices in the KM=0 DGEMM path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/trans/cpu/internal/tpm_flt.F90 Adds persistent double-precision cache arrays to FLT state.
src/trans/cpu/internal/suleg_mod.F90 Allocates and fills the KM/IM=0 DGEMM caches during Legendre setup.
src/trans/cpu/internal/ledir_mod.F90 Replaces per-call matrix allocation/conversion with cached double-precision matrices.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/trans/cpu/internal/ledir_mod.F90 Outdated
END DO
CALL GEMM('T','N',ILA,KIFC,KDGLU,1.0_JPRD,ZRPNMA,KDGLU,&

CALL GEMM('T','N',ILA,KIFC,KDGLU,1.0_JPRD,S%RPNMA_DGEMM,KDGLU,&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, and that's why we use Copilot. I will take a look at this.

Comment thread src/trans/cpu/internal/ledir_mod.F90 Outdated
CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,ZRPNMS,KDGLU,&
&ZB_D,KDGLU,0._JPRD,ZCS_D,ILS)

CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,S%RPNMS_DGEMM,KDGLU,&
Comment thread src/trans/cpu/internal/tpm_flt.F90
@samhatfield

Copy link
Copy Markdown
Collaborator

Don't worry about the failing tests. Once #407 is merged, you can then rebase this again against develop and that should resolve them.

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
@samhatfield

Copy link
Copy Markdown
Collaborator

Could you rebase against develop again @CarlosPenaDePedro? That should fix the failing tests.

Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
@samhatfield

samhatfield commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Todo:

@CarlosPenaDePedro CarlosPenaDePedro force-pushed the Ledir_IM0_caching_rebased branch from 09880fd to 2c238a5 Compare June 3, 2026 07:34
CarlosPenaDePedro and others added 5 commits June 3, 2026 09:40
Remove unnecessary caching in the FLT path and remove unnecessary casts

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
@samhatfield

Copy link
Copy Markdown
Collaborator

All tests passed. Well done @CarlosPenaDePedro, you are off the critical path and can now relax 😝

I still need to take a closer look at the codepath for reading / writing the Legendre polynomials to see if there's a clash.

Also I'd like to repeat my benchmarking from earlier as a sanity check. Load balance is one thing, but it would be good to check that there is no negative unexpected impact on overall wall times etc. I'm travelling at the moment but hopefully next week I'll find some time.

@samhatfield samhatfield added the enhancement New feature or request label Jun 4, 2026
@samhatfield samhatfield requested a review from Copilot June 9, 2026 08:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment on lines +134 to +135
CALL GEMM('T', 'N', ILA, KIFC, KDGLU, 1.0_JPRD, S%RPNMA_DGEMM, KDGLU, ZB_D, KDGLU, &
& 0._JPRD, ZCA_D, ILA)
Comment on lines +194 to +195
CALL GEMM('T', 'N', ILS, KIFC, KDGLU, 1.0_JPRD, S%RPNMS_DGEMM, KDGLU, ZB_D, KDGLU, &
& 0._JPRD, ZCS_D, ILS)
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
Comment thread src/trans/cpu/internal/suleg_mod.F90 Outdated
@samhatfield

Copy link
Copy Markdown
Collaborator

The issue Copilot identifies above is the following:

ecTrans can read pre-computed Legendre polynomials from file or a memory buffer. In this case, the computation of the polynomials is skipped. That includes the allocation and initialisation of S%RPNMA_DGEMM and S%RPNMS_DGEMM. In that case, LEDIR will segfault because those arrays are still accessed.

The proper fix is to make sure the cached arrays are initialised even when polynomials are read from file or buffer. Rather than have the double-precision IM=0 arrays written to and read from disk, I think we should simply copy the single-precision polynomials which are read into those cached arrays. This will incur a single -> double upcast which is different to what we have now, where the double-precision source arrays for the polynomials are stored directly in the cache. Then, we only have to make a slight modification in READ_LEGPOL_MOD.

It's actually closer to what we do in the develop branch. There in LEDIR we copy the stored single-precision polynomials into a double-precision buffer (so again, a single -> double upcast). This works, and implies that the problem with the IM=0 mode is not that the polynomials are too low in precision but rather the GEMM itself is too low in precision.

@CarlosPenaDePedro

Copy link
Copy Markdown
Author

In that case we could to keep the behaviour consistent across setup paths, we could also populate the SULEG S%RPNMA_DGEMM and S%RPNMS_DGEMM by upcasting the already stored single-precision RPNMA/RPNMS arrays, instead of filling it directly from the double-precision source values. This would make the SULEG and READ_LEGPOL paths behave the same way, and should also be closer to the previous implementation, where LEDIR promoted the single-precision matrix to double at runtime.

The issue Copilot identifies above is the following:

ecTrans can read pre-computed Legendre polynomials from file or a memory buffer. In this case, the computation of the polynomials is skipped. That includes the allocation and initialisation of S%RPNMA_DGEMM and S%RPNMS_DGEMM. In that case, LEDIR will segfault because those arrays are still accessed.

The proper fix is to make sure the cached arrays are initialised even when polynomials are read from file or buffer. Rather than have the double-precision IM=0 arrays written to and read from disk, I think we should simply copy the single-precision polynomials which are read into those cached arrays. This will incur a single -> double upcast which is different to what we have now, where the double-precision source arrays for the polynomials are stored directly in the cache. Then, we only have to make a slight modification in READ_LEGPOL_MOD.

It's actually closer to what we do in the develop branch. There in LEDIR we copy the stored single-precision polynomials into a double-precision buffer (so again, a single -> double upcast). This works, and implies that the problem with the IM=0 mode is not that the polynomials are too low in precision but rather the GEMM itself is too low in precision.

@samhatfield

Copy link
Copy Markdown
Collaborator

In that case we could to keep the behaviour consistent across setup paths, we could also populate the SULEG S%RPNMA_DGEMM and S%RPNMS_DGEMM by upcasting the already stored single-precision RPNMA/RPNMS arrays, instead of filling it directly from the double-precision source values. This would make the SULEG and READ_LEGPOL paths behave the same way, and should also be closer to the previous implementation, where LEDIR promoted the single-precision matrix to double at runtime.

Yes, that's exactly what I was thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants