Get distributed case to run without proc discontinuities.#501
Open
HendriceH wants to merge 90 commits into
Open
Get distributed case to run without proc discontinuities.#501HendriceH wants to merge 90 commits into
HendriceH wants to merge 90 commits into
Conversation
Additional changes: * Add test to environment if no MPI is available
Additional changes: * Add test to environment if no MPI is available
set mpi_home on intel to avoid IMPI wip add processor boudary wip stacktracing wip dist logging support wip add processor boudary add init tests fix test linkage build Ginkgo with mpi support wip distributed gko matrix fix test linkage wip store neighbour rank in boundaryMesh wip distributed matrix class
- MPI-01: computeCommunicationPattern and communicateBoundaryData (scalar+Vec3) now derive recvCounts via MPI_Alltoall instead of reusing sendCounts - MPI-02: correctBoundaryConditions syncs boundaryData().value() proc-tail into internalVector() proc-face slots after halo exchange (syncProcFaceInternalVector) - MPI-03: recvBuffer in both communicateBoundaryData overloads sized to actual recv count (totalRecv), not full boundaryData.size()
…ormula Task 1 — N-rank implementation: - Replace rank==0/1/2 if/else chain (3-rank-only) with closed-form formulas - firstFace = r*(localFaces+1) for all ranks - Rank 0: firstBoundaryFace = R*(localFaces+1)-1, secondBoundaryFace = localFaces - Middle: firstBoundaryFace = r*(localFaces+1)-1, secondBoundaryFace = r*(localFaces+1)+localFaces - Last: firstBoundaryFace = R*(localFaces+1), secondBoundaryFace = (R-1)*(localFaces+1)-1 - Remove 'FIXME this only works for 3 ranks' comment - Preserve 3-rank oracle (partitioning.cpp:218-230); hand-verify R=2 and R=4 oracles Task 2 — Doxygen doc block: - Replace single-line @brief with comprehensive derivation doc - Embed closed-form formulas, sign-flip rule, and R=2/3/4 oracle tables - Reference 02-02-SUMMARY.md and 02-RESEARCH.md NRANK-01 section for traceability
…xDist The non-local COO matrix was sliced to numNonLocalElements (unique ghost cells after imap deduplication) instead of nRecv (total proc-face count). When a ghost cell is adjacent to multiple proc faces, nRecv > numNonLocal- Elements and the trailing proc-face contributions were silently dropped, making the distributed matrix structurally incomplete. Fix: use nRecv for vals/rows/cols arrays and pass all nRecv column IDs through map_to_local. Duplicate global ghost-cell IDs correctly fold to the same local non-local column index, so the COO matrix gains the missing contributions without changing the non-local column-space size.
…overload - faceToMatrixAddress.hpp/cpp: replace three FIXME comments with verified explanations (diagOffset is correct in distributed mode; communication already done via commPattern.recvIdx; procColIdx holds global ghost IDs) - linearSystem.hpp: document that the single-arg overload atomic RHS subtraction is disabled (pending consistent serial/parallel semantics); add two-arg overload that accepts procFaceGhostValues and subtracts D_f * x_G from the RHS for each proc-face non-local coupling
…or NeoN distributed tests Registers operator and partitioning distributed test binaries at MPI sizes 2 (ENABLED — synthetic mesh fixtures require no external decomp) and 4 (DISABLED — placeholder for Phase 5 once 4-rank fixtures exist). Resolves the NRANK-01 verification_derivation warning for R=2 coverage.
…ance
- Replace the bare unconditional {…} proc-face block with a guarded
`if (nProcBoundaryFaces > 0)` block that calls exchangeProcOwnerDistance
and writes `1/(d_own + d_nei)` instead of `1/|cf - c_own|`
- Remove [[maybe_unused]] from both exec and deltaCoeffs parameters; both
are now actually used (exec goes into exchangeProcOwnerDistance and
parallelFor; deltaCoeffs is accessed via internalVector().view())
- Mirrors the updateNonOrthDeltaCoeffs proc-face block (lines 357-387)
exactly, dropping only the non-orthogonal approxCellToCell floor in
favour of a simple ROOTVSMALL guard
- bm.cf()/sf()/magSf() compressed boundary-tail accessors used throughout;
mesh_.faceCentres()/faceAreas()/magFaceAreas() (OF full-size) avoided
- Fixes asymmetric Laplacian at processor-face decomposition cuts on graded
meshes; precondition for L∞ < 1e-6 vs serial in Phase 5 cylinder3D run
…ectBoundaryConditions The MPI-02 block copied RECEIVED ghost values from boundaryData().value() proc-tail into internalVector() proc-face slots after the Alltoallv exchange. This inverted the sign convention that operators depend on: - computeUpwindInterpolationWeights reads faceFlux.internalVector()[procFaceIdx] expecting the LOCAL phi (positive = leaving the local cell). After sync, it received the NEIGHBOR's phi, which has the opposite sign → inverted weights. - computeDivProcBoundImpl reads faceFluxV[facei] expecting the LOCAL flux for F_local, diagContrib, and offDiagContrib. After sync, wrong F_local → wrong matrix entries and non-local RHS coupling. - surfaceIntegrate reads internalVector() for proc faces expecting LOCAL flux. Fix: remove the parallelFor + fence block. internalVector() proc-face slots now retain LOCAL values throughout. Ghost values (received from neighbor after exchange) remain in boundaryData().value() proc-tail, which is the correct location for operators that need them (computeLinearInterpolation, computeUpwindInterpolation).
|
Thank you for your PR, here are some useful tips:
|
Two GPU-parallel correctness fixes: 1. Kokkos device binding (initialization.hpp) NeoN::initialize() now uses MPI_Comm_split_type(MPI_COMM_TYPE_SHARED) to obtain the local rank within the node and calls InitializationSettings::set_device_id(localRank % num_devices) before Kokkos::initialize. Without this, all MPI ranks default to GPU 0 on multi-GPU nodes, leaving GPUs 1..N-1 idle and causing CUDA context contention. Skipped for serial runs, CPU-only builds, and when KOKKOS_DEVICE_ID is already set. 2. GPU-aware MPI diagnostic switch (boundaryData.hpp + CMakeLists) communicateBoundaryData already passes device pointers to MPI_Alltoallv, which is correct for CUDA/ROCm-aware MPI. Added compile-time guard -DNeoN_MPI_HOST_STAGE=ON that stages halo exchanges through host memory (device->host->MPI->host->device via Vector::copyToHost/copyToExecutor). Enables correctness verification on systems where MPI is not GPU-aware. Default is OFF; production GPU runs use the GPU-aware device-pointer path.
…MPI_HOST_STAGE GPU-01 (deferred at initialization): std::abs, std::sqrt, std::max are host-only functions that produce undefined behavior or silent garbage in NEON_LAMBDA GPU kernels. basicGeometryScheme.cpp: all NEON_LAMBDA blocks now use Kokkos::abs and Kokkos::max. vec3.hpp::mag uses Kokkos::sqrt. scalar.hpp::mag uses Kokkos::abs. Adds Kokkos_MathematicalFunctions.hpp include where needed. Remove NeoN_MPI_HOST_STAGE option (revert of previous commit): user confirmed CUDA-aware MPI is available on HPC; the host-staging diagnostic path is not needed. Kokkos device ID binding (initialization.hpp) is retained.
… distributed MPI Adds a CMake option NeoN_GINKGO_HOST_STAGE (default OFF) that compiles NEON_GINKGO_HOST_STAGE=1 and sets forceHostBuffer=true in both the scalar and Vec3 Ginkgo distributed solver paths. This stages all Ginkgo internal MPI communications through host memory, bypassing GPU-direct MPI. Use -DNeoN_GINKGO_HOST_STAGE=ON to test whether Ginkgo's CUDA-aware MPI path is causing GPU divergence without this switch. Production GPU runs should leave this OFF.
greole
reviewed
May 13, 2026
greole
reviewed
May 13, 2026
greole
reviewed
May 13, 2026
25da0df to
5b427dc
Compare
0ccf7b8 to
6932198
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Long clause session with gsd skill to find the issue with processor discontinuities.