Ledir IM=0 DP matrix caching #409
Conversation
|
Thanks for this amazing work @CarlosPenaDePedro - we will start the review when we have time, but definitely well before the October deadline you mention 🙏🏻 |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the single-precision KM=0 LEDIR DGEMM path by caching double-precision Legendre polynomial matrices at setup time, reducing repeated allocation and SP→DP conversion overhead.
Changes:
- Adds persistent
RPNMA_DGEMMandRPNMS_DGEMMcaches to the CPU FLT resolution wrapper. - Populates the caches in
SULEGforIM == 0when running single precision. - Updates
LEDIRto use the cached double-precision matrices in theKM=0DGEMM path.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/trans/cpu/internal/tpm_flt.F90 |
Adds persistent double-precision cache arrays to FLT state. |
src/trans/cpu/internal/suleg_mod.F90 |
Allocates and fills the KM/IM=0 DGEMM caches during Legendre setup. |
src/trans/cpu/internal/ledir_mod.F90 |
Replaces per-call matrix allocation/conversion with cached double-precision matrices. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| END DO | ||
| CALL GEMM('T','N',ILA,KIFC,KDGLU,1.0_JPRD,ZRPNMA,KDGLU,& | ||
|
|
||
| CALL GEMM('T','N',ILA,KIFC,KDGLU,1.0_JPRD,S%RPNMA_DGEMM,KDGLU,& |
There was a problem hiding this comment.
Oof, and that's why we use Copilot. I will take a look at this.
| CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,ZRPNMS,KDGLU,& | ||
| &ZB_D,KDGLU,0._JPRD,ZCS_D,ILS) | ||
|
|
||
| CALL GEMM('T','N',ILS,KIFC,KDGLU,1.0_JPRD,S%RPNMS_DGEMM,KDGLU,& |
|
Don't worry about the failing tests. Once #407 is merged, you can then rebase this again against develop and that should resolve them. |
|
Could you rebase against develop again @CarlosPenaDePedro? That should fix the failing tests. |
|
Todo:
|
09880fd to
2c238a5
Compare
Remove unnecessary caching in the FLT path and remove unnecessary casts Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
|
All tests passed. Well done @CarlosPenaDePedro, you are off the critical path and can now relax 😝 I still need to take a closer look at the codepath for reading / writing the Legendre polynomials to see if there's a clash. Also I'd like to repeat my benchmarking from earlier as a sanity check. Load balance is one thing, but it would be good to check that there is no negative unexpected impact on overall wall times etc. I'm travelling at the moment but hopefully next week I'll find some time. |
| CALL GEMM('T', 'N', ILA, KIFC, KDGLU, 1.0_JPRD, S%RPNMA_DGEMM, KDGLU, ZB_D, KDGLU, & | ||
| & 0._JPRD, ZCA_D, ILA) |
| CALL GEMM('T', 'N', ILS, KIFC, KDGLU, 1.0_JPRD, S%RPNMS_DGEMM, KDGLU, ZB_D, KDGLU, & | ||
| & 0._JPRD, ZCS_D, ILS) |
|
The issue Copilot identifies above is the following: ecTrans can read pre-computed Legendre polynomials from file or a memory buffer. In this case, the computation of the polynomials is skipped. That includes the allocation and initialisation of The proper fix is to make sure the cached arrays are initialised even when polynomials are read from file or buffer. Rather than have the double-precision IM=0 arrays written to and read from disk, I think we should simply copy the single-precision polynomials which are read into those cached arrays. This will incur a single -> double upcast which is different to what we have now, where the double-precision source arrays for the polynomials are stored directly in the cache. Then, we only have to make a slight modification in It's actually closer to what we do in the develop branch. There in |
|
In that case we could to keep the behaviour consistent across setup paths, we could also populate the SULEG
|
Yes, that's exactly what I was thinking. |
This is consistent with the current behaviour in develop.
Description
This PR adds persistent double-precision caches for the KM=0 DGEMM path in LEDIR.
Specifically, it introduces persistent allocatable arrays:
RPNMA_DGEMM
RPNMS_DGEMM
to avoid repeated allocations/recomputations in the DGEMM execution path for the KM=0 case. This is intended to reduce overhead and help improve load imbalance observed in this path.
More information in issue #392
The change passes the GitHub Actions test suite.
Contributor Declaration
By opening this pull request, I affirm the following: