Reduce RAM usage of nonlocal term #1088

Technici4n · 2025-04-30T16:15:58Z

Still super super WIP.

The problem lies with how the P matrices are stored in the nonlocal term. Each column of the P matrix corresponds to a projector (and each psp usually has multiple projectors). Multiple atoms with the same psp have the same projectors except for a structure factor $e^{-(G+k)\cdot r_{\text{atom}}}$.

The basic idea in this PR is to store the projectors without the phase factor, and apply the structure factor on-the-fly. The cleanest way to do it that I could find is to use a custom matrix-like type for P such that from the outside P still looks like a dense matrix.

Not sure yet what the best way to apply the structure factor on-the-fly would be. Either we use a temporary vector to hold ψ[:,iband] .* structure_factor or we write a loop but then the question is whether it will work on the GPU.

Partially fixes #1032.

TODO:

Actually fix the issue and make the code work.
Check that the CPU performance and memory allocations are still fine.
Check that the memory usage went down.
Check that the GPU performance is still fine.

Technici4n · 2025-04-30T16:24:59Z

Here is an example showing the problem:

If we have oxygen from PseudoDojo there are 13 projectors. Now assume we perform a computation with roughly $10^4$ plane waves per $k$-point, and with $10^3$ $k$-points.

Each P matrix would have $13 \cdot 10^4$ entries. With $10^3$ such matrices (one per $k$-point) and considering that each entry is a complex number which requires 16 bytes, this gives a total memory usage of roughly 2 gigabytes.

If we have the same setting but 8 oxygen atoms the memory usage goes to 16 gigabytes. With this PR, the goal is that it should remain around 2 gigabytes.

antoine-levitt · 2025-04-30T16:40:44Z

So this is a performance memory tradeoff. I did it this way because then you use full blas3 operations. I remember something like this gave a huge boost on abinit, but possibly it was blas3ifying the operation on all the bands at once that gave the performance boost (as opposed to band by band as it was done before). So I'm not opposed to switching but keep in mind the performance implications and benchmark it.

antoine-levitt · 2025-04-30T16:43:11Z

See https://docs.abinit.org/variables/dev/#use_gemm_nonlop

Technici4n · 2025-04-30T20:22:48Z

Nice reference, maybe that lets me find how abinit does it. We also need to be careful not to destroy GPU performance. 😓

antoine-levitt · 2025-04-30T20:37:49Z

If you want to look, it's in the opernla routine (I worked on that code a long time ago, it still haunts my dreams)

mfherbst

Nice. Yeah the BLAS 2 versus BLAS 3 tradeoff is something to be understood a little better by benchmarks before we commit on one way to do this.

mfherbst · 2025-05-02T07:19:37Z

src/terms/nonlocal.jl

+        for proj in eachcol(p.projectors)
+            # TODO: what allocates here? and does it use BLAS?
+            ψ_scratch .= p.structure_factors .* proj
+            C[iproj, :] .= dropdims(ψ_scratch' * ψk; dims=1)


The dropdims is weird. Is this not just a dot product ?

mfherbst · 2025-05-02T07:21:04Z

src/terms/nonlocal.jl

    ops::Vector{NonlocalOperator}
 end

+struct AtomProjectors{T <: Real,


Let's see first to what extend things show up in benchmarking, but we should perhaps rethink these structs (where to put scratch arrays, where to store them etc., how they can be reused across various datastructures and tasks etc.)

mfherbst · 2025-05-06T13:33:23Z

src/terms/nonlocal.jl

-            for iband in size(B, 2)
-                C[:, iband] .+= ψ_scratch .* (α * B[iproj, iband])
+            for iband in axes(B, 2)
+                @views C[:, iband] .+= ψ_scratch .* (α * B[iproj, iband])


5-argument mul! ?

It's not obvious but this is actually an axpy-type call. (level 1 :( )

Technici4n · 2025-05-06T13:58:53Z

test/PspUpf.jl

    end
 end
+
+@testitem "Test nonlocal term operations" tags=[:psp] setup=[mPspUpf] begin


Not sure about this test. The goal was just for me to be able to try my changes with <5s waiting time.

Technici4n · 2025-05-06T20:07:52Z

src/terms/nonlocal.jl

    D = build_projection_coefficients(basis, psp_groups)
    P = build_projection_vectors(basis, kpt, psp_groups, positions)
    P_minus_q = build_projection_vectors(basis, kpt_minus_q, psp_groups, positions)
+    # TODO: probably needs an extra parenthesis to first compute P'ψ


So... I noticed that Julia has custom * overloads for matrix multiplication with more than 2 operands to select the chain of operations that will minimize the total cost. Presumably it will almost always compute P_minus_q' * ψk first. But if it doesn't we are in trouble, so this should probably be changed to (P_minus_q' * ψk).

Technici4n · 2025-05-06T20:10:39Z

src/terms/nonlocal.jl

+                          ST <: AbstractVector{Complex{T}},
+                          PT <: AtomProjectors,
+                         } <: AbstractMatrix{Complex{T}}
+    # TODO: this is a real problem wrt. thread-safety, no?


How bad is this? DftHamiltonianBlock should handle it fine, but GenericHamiltonianBlock seems to be parallelizing over bands which will cause problems!

Technici4n · 2025-07-10T09:34:50Z

Cleaned up this PR quite a bit. Benchmarks are still missing of course. 😄

mfherbst

I've only done a quick review for now.

We should look a bit more at performance (also on GPU) and if there are a few small things here and there we can do to make code a little cleaner.

mfherbst · 2025-07-29T09:15:09Z

src/terms/nonlocal.jl

+# Add a level of indirection here to avoid ambiguity with the mul! method provided by Julia.
+LinearAlgebra.mul!(C::AbstractVector, A::Adjoint{<:Any, <:NonlocalProjectors},
+                   ψk::AbstractVector) = _mul!(C, A, ψk)
+LinearAlgebra.mul!(C::AbstractMatrix, A::Adjoint{<:Any, <:NonlocalProjectors},
+                   ψk::AbstractMatrix) = _mul!(C, A, ψk)
+
+LinearAlgebra.mul!(C::AbstractVector, A::NonlocalProjectors, B::AbstractVector,
+                   α::Number, β::Number) = _mul!(C, A, B, α, β)
+LinearAlgebra.mul!(C::AbstractMatrix, A::NonlocalProjectors, B::AbstractMatrix,
+                   α::Number, β::Number) = _mul!(C, A, B, α, β)


Why is this needed ? Surprises me a bit.

mfherbst · 2025-07-29T09:15:31Z

src/terms/nonlocal.jl

+    for at in A.parent.atoms
+        for proj in eachcol(at.projectors)


join these loops to avoid too deep nesting.

mfherbst · 2025-07-29T09:16:09Z

src/terms/nonlocal.jl

+        else
+            @view(B[iproj:iproj+nproj-1, :])
+        end
+        mul!(C, Pwork, Bwork, α, 1)


why not put β here ?

mfherbst · 2025-07-29T09:17:06Z

src/terms/nonlocal.jl

+        Bwork = if BT <: AbstractVector
+            @view(B[iproj:iproj+nproj-1])
+        else
+            @view(B[iproj:iproj+nproj-1, :])
+        end


this looks weird.

Technici4n · 2025-07-29T11:06:48Z

Yes, I am not sure of the current implementation, it's quite complex.

We can additionally save quite a bit of RAM if we allow ourselves to recompute the spherical harmonics on-the-fly. In that case we will only need to store one form factor per angular moment component l instead of 2l+1. For PD Silicon this would bring the number of projector form factors that need to be stored from 18 down to 6.

mfherbst · 2025-07-30T06:48:30Z

We can additionally save quite a bit of RAM if we allow ourselves to recompute the spherical harmonics on-the-fly. In that case we will only need to store one form factor per angular moment component l instead of 2l+1. For PD Silicon this would bring the number of projector form factors that need to be stored from 18 down to 6.

I think what we are getting to here is the usual issue with balancing memory and compute. I think this can be reasonable, but maybe we also want some way to have both options ? Or is this too much of a maintainance burden at the moment ?

mfherbst reviewed May 2, 2025

View reviewed changes

mfherbst reviewed May 6, 2025

View reviewed changes

Technici4n commented May 6, 2025

View reviewed changes

Technici4n added 11 commits May 15, 2025 17:46

WIP

ebbcc6a

WIP

1c3b8a9

WIP

49c904b

Fix correctness and allocations

1bd11c6

Missing matvec mul overloads + cleanup

903bf9c

Add dimension checks

c2d6d66

Show norms in assertion failure

3df5bb0

Remove old build_projection_vectors implementation

9146d61

Bit a bit more clever with types to fix phonon testcase

4f18296

Move NonlocalProjectors code closer to where it's used

3555c31

Add show overrides

138ea8d

Technici4n force-pushed the fix-1032 branch from 9149b79 to 138ea8d Compare May 15, 2025 15:46

abussy mentioned this pull request May 22, 2025

Reuse stored projectors in AtomicNonlocal forces #1101

Open

Technici4n added 5 commits July 9, 2025 16:05

Merge remote-tracking branch 'origin/master' into fix-1032

d788668

Remove scratch vector

55608ba

Don't comment out precompile workflow

9d7e6b3

Add parenthesis around P * psi

c1279fa

A bit of cleanup

8c45108

Technici4n marked this pull request as ready for review July 10, 2025 09:34

Fix multiplication by vector

f8f7009

mfherbst reviewed Jul 29, 2025

View reviewed changes

Technici4n mentioned this pull request Oct 11, 2025

Add Hubbard corrections and DFT+U #1158

Merged

Reduce RAM usage of nonlocal term #1088

Are you sure you want to change the base?

Reduce RAM usage of nonlocal term #1088

Uh oh!

Conversation

Technici4n commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Technici4n commented Apr 30, 2025

Uh oh!

antoine-levitt commented Apr 30, 2025

Uh oh!

antoine-levitt commented Apr 30, 2025

Uh oh!

Technici4n commented Apr 30, 2025

Uh oh!

antoine-levitt commented Apr 30, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Technici4n commented Jul 10, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Technici4n commented Jul 29, 2025

Uh oh!

mfherbst commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Technici4n commented Apr 30, 2025 •

edited

Loading