Skip to content

Get distributed case to run without proc discontinuities.#501

Open
HendriceH wants to merge 90 commits into
stack/distributedfrom
fix/testsRebase
Open

Get distributed case to run without proc discontinuities.#501
HendriceH wants to merge 90 commits into
stack/distributedfrom
fix/testsRebase

Conversation

@HendriceH
Copy link
Copy Markdown
Collaborator

Long clause session with gsd skill to find the issue with processor discontinuities.

greole added 30 commits April 29, 2026 17:44
Additional changes:
 * Add test to environment if no MPI is available
Additional changes:
 * Add test to environment if no MPI is available
set mpi_home on intel to avoid IMPI

wip add processor boudary

wip stacktracing

wip dist logging support

wip add processor boudary

add init tests

fix test linkage

build Ginkgo with mpi support

wip distributed gko matrix

fix test linkage

wip store neighbour rank in boundaryMesh

wip distributed matrix class
HendriceH added 15 commits May 7, 2026 11:07
- MPI-01: computeCommunicationPattern and communicateBoundaryData (scalar+Vec3) now
  derive recvCounts via MPI_Alltoall instead of reusing sendCounts
- MPI-02: correctBoundaryConditions syncs boundaryData().value() proc-tail into
  internalVector() proc-face slots after halo exchange (syncProcFaceInternalVector)
- MPI-03: recvBuffer in both communicateBoundaryData overloads sized to actual
  recv count (totalRecv), not full boundaryData.size()
…ormula

Task 1 — N-rank implementation:
- Replace rank==0/1/2 if/else chain (3-rank-only) with closed-form formulas
- firstFace = r*(localFaces+1) for all ranks
- Rank 0: firstBoundaryFace = R*(localFaces+1)-1, secondBoundaryFace = localFaces
- Middle: firstBoundaryFace = r*(localFaces+1)-1, secondBoundaryFace = r*(localFaces+1)+localFaces
- Last: firstBoundaryFace = R*(localFaces+1), secondBoundaryFace = (R-1)*(localFaces+1)-1
- Remove 'FIXME this only works for 3 ranks' comment
- Preserve 3-rank oracle (partitioning.cpp:218-230); hand-verify R=2 and R=4 oracles

Task 2 — Doxygen doc block:
- Replace single-line @brief with comprehensive derivation doc
- Embed closed-form formulas, sign-flip rule, and R=2/3/4 oracle tables
- Reference 02-02-SUMMARY.md and 02-RESEARCH.md NRANK-01 section for traceability
…xDist

The non-local COO matrix was sliced to numNonLocalElements (unique ghost
cells after imap deduplication) instead of nRecv (total proc-face count).
When a ghost cell is adjacent to multiple proc faces, nRecv > numNonLocal-
Elements and the trailing proc-face contributions were silently dropped,
making the distributed matrix structurally incomplete.

Fix: use nRecv for vals/rows/cols arrays and pass all nRecv column IDs
through map_to_local. Duplicate global ghost-cell IDs correctly fold to
the same local non-local column index, so the COO matrix gains the missing
contributions without changing the non-local column-space size.
…overload

- faceToMatrixAddress.hpp/cpp: replace three FIXME comments with verified
  explanations (diagOffset is correct in distributed mode; communication
  already done via commPattern.recvIdx; procColIdx holds global ghost IDs)
- linearSystem.hpp: document that the single-arg overload atomic RHS
  subtraction is disabled (pending consistent serial/parallel semantics);
  add two-arg overload that accepts procFaceGhostValues and subtracts
  D_f * x_G from the RHS for each proc-face non-local coupling
…or NeoN distributed tests

Registers operator and partitioning distributed test binaries at MPI sizes
2 (ENABLED — synthetic mesh fixtures require no external decomp) and 4
(DISABLED — placeholder for Phase 5 once 4-rank fixtures exist). Resolves
the NRANK-01 verification_derivation warning for R=2 coverage.
…ance

- Replace the bare unconditional {…} proc-face block with a guarded
  `if (nProcBoundaryFaces > 0)` block that calls exchangeProcOwnerDistance
  and writes `1/(d_own + d_nei)` instead of `1/|cf - c_own|`
- Remove [[maybe_unused]] from both exec and deltaCoeffs parameters; both
  are now actually used (exec goes into exchangeProcOwnerDistance and
  parallelFor; deltaCoeffs is accessed via internalVector().view())
- Mirrors the updateNonOrthDeltaCoeffs proc-face block (lines 357-387)
  exactly, dropping only the non-orthogonal approxCellToCell floor in
  favour of a simple ROOTVSMALL guard
- bm.cf()/sf()/magSf() compressed boundary-tail accessors used throughout;
  mesh_.faceCentres()/faceAreas()/magFaceAreas() (OF full-size) avoided
- Fixes asymmetric Laplacian at processor-face decomposition cuts on graded
  meshes; precondition for L∞ < 1e-6 vs serial in Phase 5 cylinder3D run
…ectBoundaryConditions

The MPI-02 block copied RECEIVED ghost values from boundaryData().value()
proc-tail into internalVector() proc-face slots after the Alltoallv exchange.
This inverted the sign convention that operators depend on:

- computeUpwindInterpolationWeights reads faceFlux.internalVector()[procFaceIdx]
  expecting the LOCAL phi (positive = leaving the local cell). After sync,
  it received the NEIGHBOR's phi, which has the opposite sign → inverted weights.
- computeDivProcBoundImpl reads faceFluxV[facei] expecting the LOCAL flux for
  F_local, diagContrib, and offDiagContrib. After sync, wrong F_local → wrong
  matrix entries and non-local RHS coupling.
- surfaceIntegrate reads internalVector() for proc faces expecting LOCAL flux.

Fix: remove the parallelFor + fence block. internalVector() proc-face slots
now retain LOCAL values throughout. Ghost values (received from neighbor after
exchange) remain in boundaryData().value() proc-tail, which is the correct
location for operators that need them (computeLinearInterpolation,
computeUpwindInterpolation).
@github-actions
Copy link
Copy Markdown

Thank you for your PR, here are some useful tips:

HendriceH added 4 commits May 12, 2026 23:48
Two GPU-parallel correctness fixes:

1. Kokkos device binding (initialization.hpp)
   NeoN::initialize() now uses MPI_Comm_split_type(MPI_COMM_TYPE_SHARED)
   to obtain the local rank within the node and calls
   InitializationSettings::set_device_id(localRank % num_devices) before
   Kokkos::initialize. Without this, all MPI ranks default to GPU 0 on
   multi-GPU nodes, leaving GPUs 1..N-1 idle and causing CUDA context
   contention. Skipped for serial runs, CPU-only builds, and when
   KOKKOS_DEVICE_ID is already set.

2. GPU-aware MPI diagnostic switch (boundaryData.hpp + CMakeLists)
   communicateBoundaryData already passes device pointers to MPI_Alltoallv,
   which is correct for CUDA/ROCm-aware MPI. Added compile-time guard
   -DNeoN_MPI_HOST_STAGE=ON that stages halo exchanges through host memory
   (device->host->MPI->host->device via Vector::copyToHost/copyToExecutor).
   Enables correctness verification on systems where MPI is not GPU-aware.
   Default is OFF; production GPU runs use the GPU-aware device-pointer path.
…MPI_HOST_STAGE

GPU-01 (deferred at initialization): std::abs, std::sqrt, std::max are host-only
functions that produce undefined behavior or silent garbage in NEON_LAMBDA GPU
kernels. basicGeometryScheme.cpp: all NEON_LAMBDA blocks now use Kokkos::abs and
Kokkos::max. vec3.hpp::mag uses Kokkos::sqrt. scalar.hpp::mag uses Kokkos::abs.
Adds Kokkos_MathematicalFunctions.hpp include where needed.

Remove NeoN_MPI_HOST_STAGE option (revert of previous commit): user confirmed
CUDA-aware MPI is available on HPC; the host-staging diagnostic path is not needed.
Kokkos device ID binding (initialization.hpp) is retained.
… distributed MPI

Adds a CMake option NeoN_GINKGO_HOST_STAGE (default OFF) that compiles
NEON_GINKGO_HOST_STAGE=1 and sets forceHostBuffer=true in both the scalar and
Vec3 Ginkgo distributed solver paths. This stages all Ginkgo internal MPI
communications through host memory, bypassing GPU-direct MPI.

Use -DNeoN_GINKGO_HOST_STAGE=ON to test whether Ginkgo's CUDA-aware MPI path
is causing GPU divergence without this switch. Production GPU runs should
leave this OFF.
Comment thread include/NeoN/core/primitives/scalar.hpp
Comment thread CMakeLists.txt
Comment thread include/CMakeLists.txt
@greole greole force-pushed the stack/distributed branch from 1e53239 to 71b002c Compare May 13, 2026 16:08
@greole greole force-pushed the fix/testsRebase branch from 8f157b5 to 0ccf7b8 Compare May 13, 2026 19:08
@greole greole force-pushed the stack/distributed branch 2 times, most recently from 25da0df to 5b427dc Compare May 13, 2026 19:24
@greole greole force-pushed the stack/distributed branch from 5b427dc to c8ef4d4 Compare May 13, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants