fix all remaining host-write-to-device-memory bugs in HYPRE solvers by jameslehoux · Pull Request #270 · BASE-Laboratory/OpenImpala

jameslehoux · 2026-05-06T15:34:37Z

After patching VoxelImage.from_numpy (64306d4) and FloodFill seed planting
(61cf635), an audit of the rest of src/props/ surfaced four more sites with
the same pattern: host code writing through an Array4 view that points at
device-resident iMultiFab data on CUDA builds. All would have segfaulted on
T4 / A100 / etc. once the user got past the earlier crash points.

CRITICAL — fires on every solve in the oi.tortuosity hot path:

TortuosityHypre.cpp:1137 — flux-calc solution writeback. Every call to
solver.value() reads HYPRE's solution into a host buffer with
HYPRE_StructVectorGetBoxValues, then a LoopOnCpu copies into mf_soln_temp.
On GPU the destination Array4 lives on device → segfault.

EffectiveDiffusivityHypre.cpp:721 — same pattern in getChiSolution. Every
oi.effective_diffusivity solve hits this on GPU.

EffectiveDiffusivityHypre.cpp:283 — generateActiveMask wrote
mask_arr(i,j,k,...) and read dc_arr(i,j,k,0) inside a LoopOnCpu, which
segfaults on both ends. Replaced with a ParallelFor for the mask write
plus a ReduceOps<Sum,Sum,Sum> for the three debug counters; the
std::atomic counters that the LoopOnCpu was incrementing are
unnecessary now that the reduction is one-shot.

EffectiveDiffusivityHypre.cpp:532 — pin-cell search read mask_arr from
host. Replaced with a ReduceOps over the linearised cell index of
active cells (sentinel = LONG_MAX for inactive); the winning index is
unpacked back to (i,j,k) on host.

NON-CRITICAL — fires only on opt-in paths, fixed for completeness:

TortuosityHypre.cpp:660 — plotfile writeback (write_plotfile=True only)
TortuosityHypre.cpp:709 — failed-solve plotfile (write_plotfile=True
AND solver did not converge)

Same staging recipe everywhere: if AMREX_USE_GPU, copy the host buffer
into a Gpu::DeviceVector first and ParallelFor with a manually-computed
linear index lin = (k - lo.z)nxny + (j - lo.y)*nx + (i - lo.x); else
just point src_ptr at the host buffer and the same ParallelFor expands
to a serial/OMP host loop. Trailing streamSynchronize after each MFIter
loop ensures all writes are complete before the next downstream operation
(FillBoundary, FlushBoundary, plotfile writer).

Verified clang-format passes idempotently.

Together with 64306d4 (VoxelImage) and 61cf635 (FloodFill) this should
take the GPU build from "segfaults at every entry point" to "fully
functional"; oi.tortuosity, oi.effective_diffusivity, oi.percolation_check,
and oi.volume_fraction should all run end-to-end on Colab T4 once 4.2.12
is published.

Skipped intentionally:
TortuosityHypre.cpp:984 — checkMatrixProperties() debug-only function
requires a host-side iMultiFab copy first; not
in any standard call path
TortuosityDirect.cpp — legacy Forward Euler solver, deprecated path

CI clang-format check tripped on the line in 64306d4: python/bindings/module.cpp:173:30: error: code should be clang-formatted [-Wclang-format-violations] The 100-column LLVM-based style preferred wrapping the assignment after the `=`, so the three static_casts sit on a single continuation line rather than being broken in the middle. Ran `clang-format -i` against the file and confirmed it's now idempotent under the project's .clang-format settings. No behavioural change. https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf

Reported on Colab T4 with openimpala-cuda 4.2.10: notebook §3 still crashes the kernel, this time at PercolationCheck construction. The silent crash from §3 dies inside parallelFloodFill — specifically at the seed-planting phase 1, FloodFill.cpp:141-147: for (const auto& seed : seedPoints) { if (tileBox.contains(seed)) { if (phase_arr(seed) == phaseID) { mask_arr(seed, 0) = label; // <-- host-side write to // device-resident memory } } } Same bug pattern as VoxelImage.from_numpy in 64306d4: a host-side loop writes through an Array4<int> view that points at iMultiFab data the AMReX CUDA build keeps in device memory. Reads through the view also fault, but it's the writes that consistently kill the kernel. Fix: keep the host-side per-tile filter (seedPoints can have many out-of-tile entries on multi-rank decompositions, so it's worth the short list) but stage the in-tile seeds in a Gpu::DeviceVector and plant them via amrex::ParallelFor with an AMREX_GPU_DEVICE lambda. On CPU builds the DeviceVector / copyAsync paths short out via the #ifdef AMREX_USE_GPU guards and tile_seeds.data() is used directly, so the change is a no-op for the CPU wheel. streamSynchronize() after the per-tile copyAsync is needed because the next iteration of the MFIter may submit another launch while this one is still in flight; the trailing global streamSynchronize ensures all planted seeds are visible before phase 2 (FillBoundary + wavefront expansion) starts. Also fixed VolumeFraction by inspection — confirmed it already uses ReduceOps with AMREX_GPU_DEVICE lambda (no host-write pattern). Verified clang-format passes idempotently. This needs another release tag (4.2.11) before the user can run PercolationCheck / oi.tortuosity from Colab on a GPU runtime. https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf

After patching VoxelImage.from_numpy (64306d4) and FloodFill seed planting (61cf635), an audit of the rest of src/props/ surfaced four more sites with the same pattern: host code writing through an Array4 view that points at device-resident iMultiFab data on CUDA builds. All would have segfaulted on T4 / A100 / etc. once the user got past the earlier crash points. CRITICAL — fires on every solve in the oi.tortuosity hot path: TortuosityHypre.cpp:1137 — flux-calc solution writeback. Every call to solver.value() reads HYPRE's solution into a host buffer with HYPRE_StructVectorGetBoxValues, then a LoopOnCpu copies into mf_soln_temp. On GPU the destination Array4 lives on device → segfault. EffectiveDiffusivityHypre.cpp:721 — same pattern in getChiSolution. Every oi.effective_diffusivity solve hits this on GPU. EffectiveDiffusivityHypre.cpp:283 — generateActiveMask wrote mask_arr(i,j,k,...) and read dc_arr(i,j,k,0) inside a LoopOnCpu, which segfaults on both ends. Replaced with a ParallelFor for the mask write plus a ReduceOps<Sum,Sum,Sum> for the three debug counters; the std::atomic<long> counters that the LoopOnCpu was incrementing are unnecessary now that the reduction is one-shot. EffectiveDiffusivityHypre.cpp:532 — pin-cell search read mask_arr from host. Replaced with a ReduceOps<Min> over the linearised cell index of active cells (sentinel = LONG_MAX for inactive); the winning index is unpacked back to (i,j,k) on host. NON-CRITICAL — fires only on opt-in paths, fixed for completeness: TortuosityHypre.cpp:660 — plotfile writeback (write_plotfile=True only) TortuosityHypre.cpp:709 — failed-solve plotfile (write_plotfile=True AND solver did not converge) Same staging recipe everywhere: if AMREX_USE_GPU, copy the host buffer into a Gpu::DeviceVector first and ParallelFor with a manually-computed linear index lin = (k - lo.z)*nx*ny + (j - lo.y)*nx + (i - lo.x); else just point src_ptr at the host buffer and the same ParallelFor expands to a serial/OMP host loop. Trailing streamSynchronize after each MFIter loop ensures all writes are complete before the next downstream operation (FillBoundary, FlushBoundary, plotfile writer). Verified clang-format passes idempotently. Together with 64306d4 (VoxelImage) and 61cf635 (FloodFill) this should take the GPU build from "segfaults at every entry point" to "fully functional"; oi.tortuosity, oi.effective_diffusivity, oi.percolation_check, and oi.volume_fraction should all run end-to-end on Colab T4 once 4.2.12 is published. Skipped intentionally: TortuosityHypre.cpp:984 — checkMatrixProperties() debug-only function requires a host-side iMultiFab copy first; not in any standard call path TortuosityDirect.cpp — legacy Forward Euler solver, deprecated path https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf

github-actions · 2026-05-06T15:38:00Z

Performance Benchmark Results

Size	Solver	Wall Time (s)	Tortuosity	Expected	Rel. Error	Iters	Status
64³	pcg	0.6898	0.984375	0.984375	0.00e+00	1	PASS
64³	flexgmres	0.4369	0.984375	0.984375	0.00e+00	N/A	PASS
64³	bicgstab	0.4344	0.984375	0.984375	0.00e+00	N/A	PASS
64³	gmres	0.4299	0.984375	0.984375	0.00e+00	N/A	PASS
128³	pcg	8.0556	0.992188	0.992188	0.00e+00	1	PASS
128³	flexgmres	5.7030	0.992188	0.992188	0.00e+00	N/A	PASS
128³	bicgstab	5.6681	0.992188	0.992188	0.00e+00	N/A	PASS
128³	gmres	5.6673	0.992188	0.992188	0.00e+00	N/A	PASS

Fastest solver: gmres at 64³ (0.4299s)

Benchmark: uniform block (analytical τ = (N-1)/N)

github-actions · 2026-05-06T15:46:37Z

Code Coverage Report

------------------------------------------------------------------------------
                           GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File                                       Lines     Exec  Cover   Missing
------------------------------------------------------------------------------
src/io/CathodeWrite.cpp                       95       83    87%   40-41,97-100,115-116,182-185
src/io/CathodeWrite.H                          1        1   100%
src/io/DatReader.cpp                         135      105    77%   26-27,30,35,92-93,99-100,107-109,135-137,141,144-148,152-155,162,164,208-209,242,245
src/io/DatReader.H                             1        1   100%
src/io/HDF5Reader.cpp                        344       84    24%   40-41,43-44,46-49,52,54-56,58-59,62,64-66,68-74,92-93,126-128,144-145,154-157,174-180,182-187,204,213-215,217,219-228,230-233,236-238,240-251,253-258,266,266,266,266,266,266,266,270,270,270,270,270,270,270,274,276,278,280,282,288,290,297,297,297,297,297,297,297,301,301,301,301,301,301,301,305,305,305,305,305,305,305-306,306,306,306,306,306,306,309,309,309,309,309,309,309-310,310,310,310,310,310,310-311,311,311,311,311,311,311,313,313,313,313,313,313,313-314,314,314,314,314,314,314-315,315,315,315,315,315,315,319,319,319,319,319,319,319,324,324,324,324,324,324,324-325,325,325,325,325,325,325-326,326,326,326,326,326,326-327,327,327,327,327,327,327,332,332,332,332,332,332,332,337,337,337,337,337,337,337-338,338,338,338,338,338,338,343,343,343,343,343,343,343,350,350,350,350,350,350,350,357-358,432-435,437-440
src/io/HDF5Reader.H                            3        3   100%
src/io/ImageLoader.cpp                        61       42    68%   25,38,48,60-62,64-70,72,77,89-90,92,94
src/io/RawReader.cpp                         266      135    50%   49-50,89-90,111-112,115-117,120-121,140-142,155-157,166-168,174-177,185-186,192-196,200-204,209-212,219-224,231-237,271,273-274,276,283-284,301,312,314,318,325,327,331-334,338,346-347,353-355,361-363,365-366,369,372,374,377-380,382-384,386,388-389,391,393-394,396,398-399,401,403-404,406,410-411,413,417-418,420,425,465,471-472,521-524,538,540-542,544,546-548,558,562-564,566,588
src/io/RawReader.H                             1        1   100%
src/io/TiffReader.cpp                        384      130    33%   59-65,67-69,71-73,75-77,79-80,82-84,86-88,90-92,94-96,98-99,101-103,106-108,111-112,114-117,119,122,124-127,143-144,148-150,152-158,160,186,210,217,226,228-231,240,242-245,248,255,288-293,306,309-317,319-320,323-327,331-335,338-342,344-348,351-357,359-363,367,369,375-377,379-393,396,398-402,404-409,413-418,420-425,428-429,432-434,555-575,577-578,581-588,590,593-609,612-614,670,673-674,677-683,685,689-700,702-703
src/io/TiffReader.H                            5        5   100%
src/props/BoundaryCondition.H                131       74    56%   63,68,70,216,224-229,233-236,238-244,247-249,252-253,255,258-261,264-265,271-272,274-279,285-287,290-296,299,303,365-366,371,373
src/props/ConnectedComponents.cpp             69       67    97%   94-95
src/props/ConnectedComponents.H                4        4   100%
src/props/DeffTensor.cpp                      62       59    95%   122,128-129
src/props/Diffusion.cpp                      510      378    74%   93-94,97-98,103-104,106-116,118,123-132,134-141,144-150,153-157,159-163,165,168-173,175-177,179,182-184,186-187,190-191,193,195-198,200,202-203,288-289,297-298,300,349,359-360,368-371,373-375,404-413,415,453,461,465-467,526-527,533,535,539,547,581,610,638,646,735-736,739-740,757-760,771-772,774,824
src/props/EffDiffFillMtx.H                   120      106    88%   58,216-217,221-225,229,231-235
src/props/EffectiveDiffusivityHypre.cpp      413      372    90%   189-191,193-197,352-355,464,616-619,621-623,625-628,637-640,647,676,688-691,693-695,697,709,720,722
src/props/EffectiveDiffusivityHypre.H          7        7   100%
src/props/FloodFill.cpp                       89       86    96%   94-95,235
src/props/HypreStructSolver.cpp              343      210    61%   87-88,121,133-134,145,299,309,311,314,346,356,358,361,367-370,372-376,378-379,381-385,388-389,391-392,394,397-398,401-402,404-407,409-413,415-416,418-422,425-426,428-429,431,434-435,438-439,441-443,445-451,453-457,460-461,463-464,466,469-470,473,475-477,479-485,487-491,494-495,497-498,500,503-504,507,509-511,513-516,518-522,525-526,528-529,531,534-535,538,541-542,555
src/props/HypreStructSolver.H                  6        6   100%
src/props/MacroGeometry.H                     17       17   100%
src/props/ParticleSizeDistribution.cpp        11       11   100%
src/props/ParticleSizeDistribution.H           6        6   100%
src/props/PercolationCheck.cpp                53       46    86%   32-33,49-51,68,73
src/props/PercolationCheck.H                   4        4   100%
src/props/PhysicsConfig.H                     90       89    98%   150
src/props/ResultsJSON.H                      225      222    98%   242,395,416
src/props/REVStudy.cpp                       151      128    84%   72,83-91,159,170-173,175,183-186,188-190
src/props/SolverConfig.H                      32       20    62%   30,32,37-44,75-76
src/props/SpecificSurfaceArea.cpp             56       55    98%   59
src/props/SpecificSurfaceArea.H                6        6   100%
src/props/ThroughThicknessProfile.cpp         38       38   100%
src/props/ThroughThicknessProfile.H            5        5   100%
src/props/Tortuosity.H                         2        2   100%
src/props/TortuosityDirect.cpp               219      191    87%   81-83,86,100-106,113-114,125,134,140,202-209,226,394,424,433
src/props/TortuosityDirect.H                   5        5   100%
src/props/TortuosityHypre.cpp                794      566    71%   149-150,155-156,240-243,246-248,311,335-337,340-341,343,353-355,358-360,390-393,573,597,601,622,639-640,642-644,646-655,657,669,671-681,685-691,693-697,701-703,705-707,709-718,727,729-739,743-751,753-756,758,768,774-777,779-781,790-793,795-797,813,816-817,840-845,856-859,861,898,903-906,909-911,915-918,920,922-925,927,932-934,936,985,994,999,1002-1007,1023-1026,1040-1044,1049-1054,1064-1068,1073-1078,1083-1087,1090-1093,1100-1103,1114,1123,1125,1129,1131,1153,1199-1200,1286-1288,1414-1417
src/props/TortuosityHypre.H                   15       15   100%
src/props/TortuosityHypreFill.H              127       98    77%   85,203,205-212,237-239,241-245,247-248,250,252,255-256,258-262
src/props/TortuosityKernels.H                 97       53    54%   52,56-60,62-65,69-74,76-80,84-85,90,129,143,157,243,245-248,250-253,257-260,262-265
src/props/TortuosityMLMG.cpp                  99       91    91%   160,181-183,185-186,193,206
src/props/TortuosityMLMG.H                     1        1   100%
src/props/TortuositySolverBase.cpp           301      237    78%   70-72,74-75,94-101,104,106,142-145,200,203,205,255,280,298,327,391,394-396,398,406-409,411-417,422,427-429,435-436,438-440,454,460,464-465,467,478,492,496-498,500,502,506
src/props/TortuositySolverBase.H              13       13   100%
src/props/VolumeFraction.cpp                  25       25   100%
src/props/VolumeFraction.H                     4        4   100%
------------------------------------------------------------------------------
TOTAL                                       5446     3907    71%
------------------------------------------------------------------------------

Generated by CI — coverage data from gcovr

codecov · 2026-05-06T15:47:41Z

Codecov Report

❌ Patch coverage is 70.21277% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/props/TortuosityHypre.cpp	33.33%	18 Missing ⚠️
src/props/EffectiveDiffusivityHypre.cpp	85.48%	0 Missing and 9 partials ⚠️
src/props/FloodFill.cpp	80.00%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

James Le Houx added 3 commits May 6, 2026 09:48

jameslehoux merged commit 83ae739 into master May 6, 2026

github-actions Bot added physics python labels May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix all remaining host-write-to-device-memory bugs in HYPRE solvers#270

fix all remaining host-write-to-device-memory bugs in HYPRE solvers#270
jameslehoux merged 3 commits intomasterfrom
claude/upbeat-mccarthy-f1mNN

jameslehoux commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

codecov Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jameslehoux commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 6, 2026

Code Coverage Report

Uh oh!

codecov Bot commented May 6, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant