-
Notifications
You must be signed in to change notification settings - Fork 98
Reduce RAM usage of nonlocal term #1088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Here is an example showing the problem: If we have oxygen from PseudoDojo there are 13 projectors. Now assume we perform a computation with roughly Each If we have the same setting but 8 oxygen atoms the memory usage goes to 16 gigabytes. With this PR, the goal is that it should remain around 2 gigabytes. |
|
So this is a performance memory tradeoff. I did it this way because then you use full blas3 operations. I remember something like this gave a huge boost on abinit, but possibly it was blas3ifying the operation on all the bands at once that gave the performance boost (as opposed to band by band as it was done before). So I'm not opposed to switching but keep in mind the performance implications and benchmark it. |
|
Nice reference, maybe that lets me find how abinit does it. We also need to be careful not to destroy GPU performance. 😓 |
|
If you want to look, it's in the opernla routine (I worked on that code a long time ago, it still haunts my dreams) |
mfherbst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Yeah the BLAS 2 versus BLAS 3 tradeoff is something to be understood a little better by benchmarks before we commit on one way to do this.
src/terms/nonlocal.jl
Outdated
| for proj in eachcol(p.projectors) | ||
| # TODO: what allocates here? and does it use BLAS? | ||
| ψ_scratch .= p.structure_factors .* proj | ||
| C[iproj, :] .= dropdims(ψ_scratch' * ψk; dims=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dropdims is weird. Is this not just a dot product ?
src/terms/nonlocal.jl
Outdated
| ops::Vector{NonlocalOperator} | ||
| end | ||
|
|
||
| struct AtomProjectors{T <: Real, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see first to what extend things show up in benchmarking, but we should perhaps rethink these structs (where to put scratch arrays, where to store them etc., how they can be reused across various datastructures and tasks etc.)
src/terms/nonlocal.jl
Outdated
| for iband in size(B, 2) | ||
| C[:, iband] .+= ψ_scratch .* (α * B[iproj, iband]) | ||
| for iband in axes(B, 2) | ||
| @views C[:, iband] .+= ψ_scratch .* (α * B[iproj, iband]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5-argument mul! ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious but this is actually an axpy-type call. (level 1 :( )
test/PspUpf.jl
Outdated
| end | ||
| end | ||
|
|
||
| @testitem "Test nonlocal term operations" tags=[:psp] setup=[mPspUpf] begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this test. The goal was just for me to be able to try my changes with <5s waiting time.
src/terms/nonlocal.jl
Outdated
| D = build_projection_coefficients(basis, psp_groups) | ||
| P = build_projection_vectors(basis, kpt, psp_groups, positions) | ||
| P_minus_q = build_projection_vectors(basis, kpt_minus_q, psp_groups, positions) | ||
| # TODO: probably needs an extra parenthesis to first compute P'ψ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... I noticed that Julia has custom * overloads for matrix multiplication with more than 2 operands to select the chain of operations that will minimize the total cost. Presumably it will almost always compute P_minus_q' * ψk first. But if it doesn't we are in trouble, so this should probably be changed to (P_minus_q' * ψk).
src/terms/nonlocal.jl
Outdated
| ST <: AbstractVector{Complex{T}}, | ||
| PT <: AtomProjectors, | ||
| } <: AbstractMatrix{Complex{T}} | ||
| # TODO: this is a real problem wrt. thread-safety, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bad is this? DftHamiltonianBlock should handle it fine, but GenericHamiltonianBlock seems to be parallelizing over bands which will cause problems!
|
Cleaned up this PR quite a bit. Benchmarks are still missing of course. 😄 |
mfherbst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only done a quick review for now.
We should look a bit more at performance (also on GPU) and if there are a few small things here and there we can do to make code a little cleaner.
| # Add a level of indirection here to avoid ambiguity with the mul! method provided by Julia. | ||
| LinearAlgebra.mul!(C::AbstractVector, A::Adjoint{<:Any, <:NonlocalProjectors}, | ||
| ψk::AbstractVector) = _mul!(C, A, ψk) | ||
| LinearAlgebra.mul!(C::AbstractMatrix, A::Adjoint{<:Any, <:NonlocalProjectors}, | ||
| ψk::AbstractMatrix) = _mul!(C, A, ψk) | ||
|
|
||
| LinearAlgebra.mul!(C::AbstractVector, A::NonlocalProjectors, B::AbstractVector, | ||
| α::Number, β::Number) = _mul!(C, A, B, α, β) | ||
| LinearAlgebra.mul!(C::AbstractMatrix, A::NonlocalProjectors, B::AbstractMatrix, | ||
| α::Number, β::Number) = _mul!(C, A, B, α, β) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed ? Surprises me a bit.
| for at in A.parent.atoms | ||
| for proj in eachcol(at.projectors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
join these loops to avoid too deep nesting.
| else | ||
| @view(B[iproj:iproj+nproj-1, :]) | ||
| end | ||
| mul!(C, Pwork, Bwork, α, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not put β here ?
| Bwork = if BT <: AbstractVector | ||
| @view(B[iproj:iproj+nproj-1]) | ||
| else | ||
| @view(B[iproj:iproj+nproj-1, :]) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks weird.
|
Yes, I am not sure of the current implementation, it's quite complex. We can additionally save quite a bit of RAM if we allow ourselves to recompute the spherical harmonics on-the-fly. In that case we will only need to store one form factor per angular moment component l instead of 2l+1. For PD Silicon this would bring the number of projector form factors that need to be stored from 18 down to 6. |
I think what we are getting to here is the usual issue with balancing memory and compute. I think this can be reasonable, but maybe we also want some way to have both options ? Or is this too much of a maintainance burden at the moment ? |
Still super super WIP.
The problem lies with how the$e^{-(G+k)\cdot r_{\text{atom}}}$ .
Pmatrices are stored in the nonlocal term. Each column of thePmatrix corresponds to a projector (and each psp usually has multiple projectors). Multiple atoms with the same psp have the same projectors except for a structure factorThe basic idea in this PR is to store the projectors without the phase factor, and apply the structure factor on-the-fly. The cleanest way to do it that I could find is to use a custom matrix-like type for
Psuch that from the outsidePstill looks like a dense matrix.Not sure yet what the best way to apply the structure factor on-the-fly would be. Either we use a temporary vector to hold
ψ[:,iband] .* structure_factoror we write a loop but then the question is whether it will work on the GPU.Partially fixes #1032.
TODO: