Get distributed case to run without proc discontinuities. by HendriceH · Pull Request #501 · exasim-project/NeoN

HendriceH · 2026-05-12T21:11:36Z

Long clause session with gsd skill to find the issue with processor discontinuities.

Additional changes: * Add test to environment if no MPI is available

set mpi_home on intel to avoid IMPI wip add processor boudary wip stacktracing wip dist logging support wip add processor boudary add init tests fix test linkage build Ginkgo with mpi support wip distributed gko matrix fix test linkage wip store neighbour rank in boundaryMesh wip distributed matrix class

- MPI-01: computeCommunicationPattern and communicateBoundaryData (scalar+Vec3) now derive recvCounts via MPI_Alltoall instead of reusing sendCounts - MPI-02: correctBoundaryConditions syncs boundaryData().value() proc-tail into internalVector() proc-face slots after halo exchange (syncProcFaceInternalVector) - MPI-03: recvBuffer in both communicateBoundaryData overloads sized to actual recv count (totalRecv), not full boundaryData.size()

@brief

…ormula Task 1 — N-rank implementation: - Replace rank==0/1/2 if/else chain (3-rank-only) with closed-form formulas - firstFace = r*(localFaces+1) for all ranks - Rank 0: firstBoundaryFace = R*(localFaces+1)-1, secondBoundaryFace = localFaces - Middle: firstBoundaryFace = r*(localFaces+1)-1, secondBoundaryFace = r*(localFaces+1)+localFaces - Last: firstBoundaryFace = R*(localFaces+1), secondBoundaryFace = (R-1)*(localFaces+1)-1 - Remove 'FIXME this only works for 3 ranks' comment - Preserve 3-rank oracle (partitioning.cpp:218-230); hand-verify R=2 and R=4 oracles Task 2 — Doxygen doc block: - Replace single-line @brief with comprehensive derivation doc - Embed closed-form formulas, sign-flip rule, and R=2/3/4 oracle tables - Reference 02-02-SUMMARY.md and 02-RESEARCH.md NRANK-01 section for traceability

…xDist The non-local COO matrix was sliced to numNonLocalElements (unique ghost cells after imap deduplication) instead of nRecv (total proc-face count). When a ghost cell is adjacent to multiple proc faces, nRecv > numNonLocal- Elements and the trailing proc-face contributions were silently dropped, making the distributed matrix structurally incomplete. Fix: use nRecv for vals/rows/cols arrays and pass all nRecv column IDs through map_to_local. Duplicate global ghost-cell IDs correctly fold to the same local non-local column index, so the COO matrix gains the missing contributions without changing the non-local column-space size.

…overload - faceToMatrixAddress.hpp/cpp: replace three FIXME comments with verified explanations (diagOffset is correct in distributed mode; communication already done via commPattern.recvIdx; procColIdx holds global ghost IDs) - linearSystem.hpp: document that the single-arg overload atomic RHS subtraction is disabled (pending consistent serial/parallel semantics); add two-arg overload that accepts procFaceGhostValues and subtracts D_f * x_G from the RHS for each proc-face non-local coupling

…or NeoN distributed tests Registers operator and partitioning distributed test binaries at MPI sizes 2 (ENABLED — synthetic mesh fixtures require no external decomp) and 4 (DISABLED — placeholder for Phase 5 once 4-rank fixtures exist). Resolves the NRANK-01 verification_derivation warning for R=2 coverage.

…ance - Replace the bare unconditional {…} proc-face block with a guarded `if (nProcBoundaryFaces > 0)` block that calls exchangeProcOwnerDistance and writes `1/(d_own + d_nei)` instead of `1/|cf - c_own|` - Remove [[maybe_unused]] from both exec and deltaCoeffs parameters; both are now actually used (exec goes into exchangeProcOwnerDistance and parallelFor; deltaCoeffs is accessed via internalVector().view()) - Mirrors the updateNonOrthDeltaCoeffs proc-face block (lines 357-387) exactly, dropping only the non-orthogonal approxCellToCell floor in favour of a simple ROOTVSMALL guard - bm.cf()/sf()/magSf() compressed boundary-tail accessors used throughout; mesh_.faceCentres()/faceAreas()/magFaceAreas() (OF full-size) avoided - Fixes asymmetric Laplacian at processor-face decomposition cuts on graded meshes; precondition for L∞ < 1e-6 vs serial in Phase 5 cylinder3D run

…ectBoundaryConditions The MPI-02 block copied RECEIVED ghost values from boundaryData().value() proc-tail into internalVector() proc-face slots after the Alltoallv exchange. This inverted the sign convention that operators depend on: - computeUpwindInterpolationWeights reads faceFlux.internalVector()[procFaceIdx] expecting the LOCAL phi (positive = leaving the local cell). After sync, it received the NEIGHBOR's phi, which has the opposite sign → inverted weights. - computeDivProcBoundImpl reads faceFluxV[facei] expecting the LOCAL flux for F_local, diagContrib, and offDiagContrib. After sync, wrong F_local → wrong matrix entries and non-local RHS coupling. - surfaceIntegrate reads internalVector() for proc faces expecting LOCAL flux. Fix: remove the parallelFor + fence block. internalVector() proc-face slots now retain LOCAL values throughout. Ghost values (received from neighbor after exchange) remain in boundaryData().value() proc-tail, which is the correct location for operators that need them (computeLinearInterpolation, computeUpwindInterpolation).

github-actions · 2026-05-12T21:13:40Z

Thank you for your PR, here are some useful tips:

If this is your first PR, please add yourself to the AUTHORS.md file
Use labels to change the workflow behavior. Set auto-fix to automatically apply fixes, full-ci for a comprehensive set of tests and sanitizer runs, ready-for-review to show that you're done and need a review.
I deployed a test documentation to https://exasim-project.com/NeoN/Build_PR_501
For full-ci builds benchmark results are available at https://github.com/exasim-project/NeoFOAM-BenchmarkData/501
Also check if the GPU pipelines on https://gitlab-ce.lrz.de/greole/neon/-/jobs are successful.

Two GPU-parallel correctness fixes: 1. Kokkos device binding (initialization.hpp) NeoN::initialize() now uses MPI_Comm_split_type(MPI_COMM_TYPE_SHARED) to obtain the local rank within the node and calls InitializationSettings::set_device_id(localRank % num_devices) before Kokkos::initialize. Without this, all MPI ranks default to GPU 0 on multi-GPU nodes, leaving GPUs 1..N-1 idle and causing CUDA context contention. Skipped for serial runs, CPU-only builds, and when KOKKOS_DEVICE_ID is already set. 2. GPU-aware MPI diagnostic switch (boundaryData.hpp + CMakeLists) communicateBoundaryData already passes device pointers to MPI_Alltoallv, which is correct for CUDA/ROCm-aware MPI. Added compile-time guard -DNeoN_MPI_HOST_STAGE=ON that stages halo exchanges through host memory (device->host->MPI->host->device via Vector::copyToHost/copyToExecutor). Enables correctness verification on systems where MPI is not GPU-aware. Default is OFF; production GPU runs use the GPU-aware device-pointer path.

…MPI_HOST_STAGE GPU-01 (deferred at initialization): std::abs, std::sqrt, std::max are host-only functions that produce undefined behavior or silent garbage in NEON_LAMBDA GPU kernels. basicGeometryScheme.cpp: all NEON_LAMBDA blocks now use Kokkos::abs and Kokkos::max. vec3.hpp::mag uses Kokkos::sqrt. scalar.hpp::mag uses Kokkos::abs. Adds Kokkos_MathematicalFunctions.hpp include where needed. Remove NeoN_MPI_HOST_STAGE option (revert of previous commit): user confirmed CUDA-aware MPI is available on HPC; the host-staging diagnostic path is not needed. Kokkos device ID binding (initialization.hpp) is retained.

… distributed MPI Adds a CMake option NeoN_GINKGO_HOST_STAGE (default OFF) that compiles NEON_GINKGO_HOST_STAGE=1 and sets forceHostBuffer=true in both the scalar and Vec3 Ginkgo distributed solver paths. This stages all Ginkgo internal MPI communications through host memory, bypassing GPU-direct MPI. Use -DNeoN_GINKGO_HOST_STAGE=ON to test whether Ginkgo's CUDA-aware MPI path is causing GPU divergence without this switch. Production GPU runs should leave this OFF.

greole added 30 commits April 29, 2026 17:44

empty

64f5ca5

Fix compiliation of distributed tests

353c43b

Additional changes: * Add test to environment if no MPI is available

Add test to environment if MPI is available

9ff20c9

improve cmake warning message for deactivated tests

f3d48a7

force numprocs to 3 -> oversubscribe on github

5a2677a

fix mpi add_subdirectory

4e2bb17

fix linkage

bad62a5

rename test target to test_ to avoid collision

94d1619

print which mpirun

b2a989b

set mpi_home on intel to avoid IMPI

a59105f

try IMPI on more recent oneAPI version

21abc47

empty

e882189

Fix compiliation of distributed tests

4fe255d

Additional changes: * Add test to environment if no MPI is available

Add test to environment if MPI is available

2f706e3

force numprocs to 3 -> oversubscribe on github

c156f35

rename test target to test_ to avoid collision

f94aeef

empty

b804311

wip fix compilation issues

b0041f7

add distributed test

7df8ee4

add wip partion helpers 4 testing + take

d2056fe

assert if executor variant is valueless

e2a9353

fix partitioning of surfaceFields

e211500

wip, split boundary coeff compute

5d19512

fixup bindings

70da575

add ginkgo serial tests again

ec232ef

move ginkgo dist implementation

0bb1766

create with mpiEnviron

520f376

wip split boundary and nonLocalMtx

5986edc

add COO sparsity

e8e42d8

HendriceH added 15 commits May 7, 2026 11:07

wip add GPU fences

4b5d473

Fix tests

bff0a18

Added surfaceField proc exchange

f0544e0

wip fix processor discontinuities, reversed weight calculation in linear

6803c2f

fix compilation error after f0544e0

29676a3

wip further fixes for distributed

969dd26

wip fix sort order

50bf325

wip div changes

7e1a53c

HendriceH mentioned this pull request May 12, 2026

Feat/distributed correctness exasim-project/NeoFOAM#278

Open

HendriceH added 4 commits May 12, 2026 23:48

Added fences()

6932198

greole reviewed May 13, 2026

View reviewed changes

Comment thread include/NeoN/core/primitives/scalar.hpp

greole reviewed May 13, 2026

View reviewed changes

Comment thread CMakeLists.txt

greole reviewed May 13, 2026

View reviewed changes

Comment thread include/CMakeLists.txt

greole force-pushed the stack/distributed branch from 1e53239 to 71b002c Compare May 13, 2026 16:08

greole force-pushed the fix/testsRebase branch from 8f157b5 to 0ccf7b8 Compare May 13, 2026 19:08

greole force-pushed the stack/distributed branch 2 times, most recently from 25da0df to 5b427dc Compare May 13, 2026 19:24

HendriceH force-pushed the fix/testsRebase branch from 0ccf7b8 to 6932198 Compare May 13, 2026 19:32

greole force-pushed the stack/distributed branch from 5b427dc to c8ef4d4 Compare May 13, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get distributed case to run without proc discontinuities.#501

Get distributed case to run without proc discontinuities.#501
HendriceH wants to merge 90 commits into
stack/distributedfrom
fix/testsRebase

HendriceH commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HendriceH commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants