fix all remaining host-write-to-device-memory bugs in HYPRE solvers#270
Merged
jameslehoux merged 3 commits intomasterfrom May 6, 2026
Merged
fix all remaining host-write-to-device-memory bugs in HYPRE solvers#270jameslehoux merged 3 commits intomasterfrom
jameslehoux merged 3 commits intomasterfrom
Conversation
added 3 commits
May 6, 2026 09:48
CI clang-format check tripped on the line in 64306d4: python/bindings/module.cpp:173:30: error: code should be clang-formatted [-Wclang-format-violations] The 100-column LLVM-based style preferred wrapping the assignment after the `=`, so the three static_casts sit on a single continuation line rather than being broken in the middle. Ran `clang-format -i` against the file and confirmed it's now idempotent under the project's .clang-format settings. No behavioural change. https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
Reported on Colab T4 with openimpala-cuda 4.2.10: notebook §3 still
crashes the kernel, this time at PercolationCheck construction. The
silent crash from §3 dies inside parallelFloodFill — specifically at
the seed-planting phase 1, FloodFill.cpp:141-147:
for (const auto& seed : seedPoints) {
if (tileBox.contains(seed)) {
if (phase_arr(seed) == phaseID) {
mask_arr(seed, 0) = label; // <-- host-side write to
// device-resident memory
}
}
}
Same bug pattern as VoxelImage.from_numpy in 64306d4: a host-side
loop writes through an Array4<int> view that points at iMultiFab data
the AMReX CUDA build keeps in device memory. Reads through the view
also fault, but it's the writes that consistently kill the kernel.
Fix: keep the host-side per-tile filter (seedPoints can have many
out-of-tile entries on multi-rank decompositions, so it's worth the
short list) but stage the in-tile seeds in a Gpu::DeviceVector and
plant them via amrex::ParallelFor with an AMREX_GPU_DEVICE lambda.
On CPU builds the DeviceVector / copyAsync paths short out via the
#ifdef AMREX_USE_GPU guards and tile_seeds.data() is used directly,
so the change is a no-op for the CPU wheel.
streamSynchronize() after the per-tile copyAsync is needed because
the next iteration of the MFIter may submit another launch while
this one is still in flight; the trailing global streamSynchronize
ensures all planted seeds are visible before phase 2 (FillBoundary +
wavefront expansion) starts.
Also fixed VolumeFraction by inspection — confirmed it already uses
ReduceOps with AMREX_GPU_DEVICE lambda (no host-write pattern).
Verified clang-format passes idempotently.
This needs another release tag (4.2.11) before the user can run
PercolationCheck / oi.tortuosity from Colab on a GPU runtime.
https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
After patching VoxelImage.from_numpy (64306d4) and FloodFill seed planting (61cf635), an audit of the rest of src/props/ surfaced four more sites with the same pattern: host code writing through an Array4 view that points at device-resident iMultiFab data on CUDA builds. All would have segfaulted on T4 / A100 / etc. once the user got past the earlier crash points. CRITICAL — fires on every solve in the oi.tortuosity hot path: TortuosityHypre.cpp:1137 — flux-calc solution writeback. Every call to solver.value() reads HYPRE's solution into a host buffer with HYPRE_StructVectorGetBoxValues, then a LoopOnCpu copies into mf_soln_temp. On GPU the destination Array4 lives on device → segfault. EffectiveDiffusivityHypre.cpp:721 — same pattern in getChiSolution. Every oi.effective_diffusivity solve hits this on GPU. EffectiveDiffusivityHypre.cpp:283 — generateActiveMask wrote mask_arr(i,j,k,...) and read dc_arr(i,j,k,0) inside a LoopOnCpu, which segfaults on both ends. Replaced with a ParallelFor for the mask write plus a ReduceOps<Sum,Sum,Sum> for the three debug counters; the std::atomic<long> counters that the LoopOnCpu was incrementing are unnecessary now that the reduction is one-shot. EffectiveDiffusivityHypre.cpp:532 — pin-cell search read mask_arr from host. Replaced with a ReduceOps<Min> over the linearised cell index of active cells (sentinel = LONG_MAX for inactive); the winning index is unpacked back to (i,j,k) on host. NON-CRITICAL — fires only on opt-in paths, fixed for completeness: TortuosityHypre.cpp:660 — plotfile writeback (write_plotfile=True only) TortuosityHypre.cpp:709 — failed-solve plotfile (write_plotfile=True AND solver did not converge) Same staging recipe everywhere: if AMREX_USE_GPU, copy the host buffer into a Gpu::DeviceVector first and ParallelFor with a manually-computed linear index lin = (k - lo.z)*nx*ny + (j - lo.y)*nx + (i - lo.x); else just point src_ptr at the host buffer and the same ParallelFor expands to a serial/OMP host loop. Trailing streamSynchronize after each MFIter loop ensures all writes are complete before the next downstream operation (FillBoundary, FlushBoundary, plotfile writer). Verified clang-format passes idempotently. Together with 64306d4 (VoxelImage) and 61cf635 (FloodFill) this should take the GPU build from "segfaults at every entry point" to "fully functional"; oi.tortuosity, oi.effective_diffusivity, oi.percolation_check, and oi.volume_fraction should all run end-to-end on Colab T4 once 4.2.12 is published. Skipped intentionally: TortuosityHypre.cpp:984 — checkMatrixProperties() debug-only function requires a host-side iMultiFab copy first; not in any standard call path TortuosityDirect.cpp — legacy Forward Euler solver, deprecated path https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
Performance Benchmark Results
Fastest solver: gmres at 64³ (0.4299s) Benchmark: uniform block (analytical τ = (N-1)/N) |
Code Coverage ReportGenerated by CI — coverage data from gcovr |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After patching VoxelImage.from_numpy (64306d4) and FloodFill seed planting
(61cf635), an audit of the rest of src/props/ surfaced four more sites with
the same pattern: host code writing through an Array4 view that points at
device-resident iMultiFab data on CUDA builds. All would have segfaulted on
T4 / A100 / etc. once the user got past the earlier crash points.
CRITICAL — fires on every solve in the oi.tortuosity hot path:
TortuosityHypre.cpp:1137 — flux-calc solution writeback. Every call to
solver.value() reads HYPRE's solution into a host buffer with
HYPRE_StructVectorGetBoxValues, then a LoopOnCpu copies into mf_soln_temp.
On GPU the destination Array4 lives on device → segfault.
EffectiveDiffusivityHypre.cpp:721 — same pattern in getChiSolution. Every
oi.effective_diffusivity solve hits this on GPU.
EffectiveDiffusivityHypre.cpp:283 — generateActiveMask wrote
mask_arr(i,j,k,...) and read dc_arr(i,j,k,0) inside a LoopOnCpu, which
segfaults on both ends. Replaced with a ParallelFor for the mask write
plus a ReduceOps<Sum,Sum,Sum> for the three debug counters; the
std::atomic counters that the LoopOnCpu was incrementing are
unnecessary now that the reduction is one-shot.
EffectiveDiffusivityHypre.cpp:532 — pin-cell search read mask_arr from
host. Replaced with a ReduceOps over the linearised cell index of
active cells (sentinel = LONG_MAX for inactive); the winning index is
unpacked back to (i,j,k) on host.
NON-CRITICAL — fires only on opt-in paths, fixed for completeness:
TortuosityHypre.cpp:660 — plotfile writeback (write_plotfile=True only)
TortuosityHypre.cpp:709 — failed-solve plotfile (write_plotfile=True
AND solver did not converge)
Same staging recipe everywhere: if AMREX_USE_GPU, copy the host buffer
into a Gpu::DeviceVector first and ParallelFor with a manually-computed
linear index lin = (k - lo.z)nxny + (j - lo.y)*nx + (i - lo.x); else
just point src_ptr at the host buffer and the same ParallelFor expands
to a serial/OMP host loop. Trailing streamSynchronize after each MFIter
loop ensures all writes are complete before the next downstream operation
(FillBoundary, FlushBoundary, plotfile writer).
Verified clang-format passes idempotently.
Together with 64306d4 (VoxelImage) and 61cf635 (FloodFill) this should
take the GPU build from "segfaults at every entry point" to "fully
functional"; oi.tortuosity, oi.effective_diffusivity, oi.percolation_check,
and oi.volume_fraction should all run end-to-end on Colab T4 once 4.2.12
is published.
Skipped intentionally:
TortuosityHypre.cpp:984 — checkMatrixProperties() debug-only function
requires a host-side iMultiFab copy first; not
in any standard call path
TortuosityDirect.cpp — legacy Forward Euler solver, deprecated path