Skip to content

fix all remaining host-write-to-device-memory bugs in HYPRE solvers#270

Merged
jameslehoux merged 3 commits intomasterfrom
claude/upbeat-mccarthy-f1mNN
May 6, 2026
Merged

fix all remaining host-write-to-device-memory bugs in HYPRE solvers#270
jameslehoux merged 3 commits intomasterfrom
claude/upbeat-mccarthy-f1mNN

Conversation

@jameslehoux
Copy link
Copy Markdown

After patching VoxelImage.from_numpy (64306d4) and FloodFill seed planting
(61cf635), an audit of the rest of src/props/ surfaced four more sites with
the same pattern: host code writing through an Array4 view that points at
device-resident iMultiFab data on CUDA builds. All would have segfaulted on
T4 / A100 / etc. once the user got past the earlier crash points.

CRITICAL — fires on every solve in the oi.tortuosity hot path:

TortuosityHypre.cpp:1137 — flux-calc solution writeback. Every call to
solver.value() reads HYPRE's solution into a host buffer with
HYPRE_StructVectorGetBoxValues, then a LoopOnCpu copies into mf_soln_temp.
On GPU the destination Array4 lives on device → segfault.

EffectiveDiffusivityHypre.cpp:721 — same pattern in getChiSolution. Every
oi.effective_diffusivity solve hits this on GPU.

EffectiveDiffusivityHypre.cpp:283 — generateActiveMask wrote
mask_arr(i,j,k,...) and read dc_arr(i,j,k,0) inside a LoopOnCpu, which
segfaults on both ends. Replaced with a ParallelFor for the mask write
plus a ReduceOps<Sum,Sum,Sum> for the three debug counters; the
std::atomic counters that the LoopOnCpu was incrementing are
unnecessary now that the reduction is one-shot.

EffectiveDiffusivityHypre.cpp:532 — pin-cell search read mask_arr from
host. Replaced with a ReduceOps over the linearised cell index of
active cells (sentinel = LONG_MAX for inactive); the winning index is
unpacked back to (i,j,k) on host.

NON-CRITICAL — fires only on opt-in paths, fixed for completeness:

TortuosityHypre.cpp:660 — plotfile writeback (write_plotfile=True only)
TortuosityHypre.cpp:709 — failed-solve plotfile (write_plotfile=True
AND solver did not converge)

Same staging recipe everywhere: if AMREX_USE_GPU, copy the host buffer
into a Gpu::DeviceVector first and ParallelFor with a manually-computed
linear index lin = (k - lo.z)nxny + (j - lo.y)*nx + (i - lo.x); else
just point src_ptr at the host buffer and the same ParallelFor expands
to a serial/OMP host loop. Trailing streamSynchronize after each MFIter
loop ensures all writes are complete before the next downstream operation
(FillBoundary, FlushBoundary, plotfile writer).

Verified clang-format passes idempotently.

Together with 64306d4 (VoxelImage) and 61cf635 (FloodFill) this should
take the GPU build from "segfaults at every entry point" to "fully
functional"; oi.tortuosity, oi.effective_diffusivity, oi.percolation_check,
and oi.volume_fraction should all run end-to-end on Colab T4 once 4.2.12
is published.

Skipped intentionally:
TortuosityHypre.cpp:984 — checkMatrixProperties() debug-only function
requires a host-side iMultiFab copy first; not
in any standard call path
TortuosityDirect.cpp — legacy Forward Euler solver, deprecated path

James Le Houx added 3 commits May 6, 2026 09:48
CI clang-format check tripped on the line in 64306d4:

    python/bindings/module.cpp:173:30: error: code should be
      clang-formatted [-Wclang-format-violations]

The 100-column LLVM-based style preferred wrapping the assignment after
the `=`, so the three static_casts sit on a single continuation line
rather than being broken in the middle. Ran `clang-format -i` against
the file and confirmed it's now idempotent under the project's
.clang-format settings.

No behavioural change.

https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
Reported on Colab T4 with openimpala-cuda 4.2.10: notebook §3 still
crashes the kernel, this time at PercolationCheck construction. The
silent crash from §3 dies inside parallelFloodFill — specifically at
the seed-planting phase 1, FloodFill.cpp:141-147:

    for (const auto& seed : seedPoints) {
        if (tileBox.contains(seed)) {
            if (phase_arr(seed) == phaseID) {
                mask_arr(seed, 0) = label;   // <-- host-side write to
                                             //     device-resident memory
            }
        }
    }

Same bug pattern as VoxelImage.from_numpy in 64306d4: a host-side
loop writes through an Array4<int> view that points at iMultiFab data
the AMReX CUDA build keeps in device memory. Reads through the view
also fault, but it's the writes that consistently kill the kernel.

Fix: keep the host-side per-tile filter (seedPoints can have many
out-of-tile entries on multi-rank decompositions, so it's worth the
short list) but stage the in-tile seeds in a Gpu::DeviceVector and
plant them via amrex::ParallelFor with an AMREX_GPU_DEVICE lambda.
On CPU builds the DeviceVector / copyAsync paths short out via the
#ifdef AMREX_USE_GPU guards and tile_seeds.data() is used directly,
so the change is a no-op for the CPU wheel.

streamSynchronize() after the per-tile copyAsync is needed because
the next iteration of the MFIter may submit another launch while
this one is still in flight; the trailing global streamSynchronize
ensures all planted seeds are visible before phase 2 (FillBoundary +
wavefront expansion) starts.

Also fixed VolumeFraction by inspection — confirmed it already uses
ReduceOps with AMREX_GPU_DEVICE lambda (no host-write pattern).

Verified clang-format passes idempotently.

This needs another release tag (4.2.11) before the user can run
PercolationCheck / oi.tortuosity from Colab on a GPU runtime.

https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
After patching VoxelImage.from_numpy (64306d4) and FloodFill seed planting
(61cf635), an audit of the rest of src/props/ surfaced four more sites with
the same pattern: host code writing through an Array4 view that points at
device-resident iMultiFab data on CUDA builds. All would have segfaulted on
T4 / A100 / etc. once the user got past the earlier crash points.

CRITICAL — fires on every solve in the oi.tortuosity hot path:

  TortuosityHypre.cpp:1137 — flux-calc solution writeback. Every call to
  solver.value() reads HYPRE's solution into a host buffer with
  HYPRE_StructVectorGetBoxValues, then a LoopOnCpu copies into mf_soln_temp.
  On GPU the destination Array4 lives on device → segfault.

  EffectiveDiffusivityHypre.cpp:721 — same pattern in getChiSolution. Every
  oi.effective_diffusivity solve hits this on GPU.

  EffectiveDiffusivityHypre.cpp:283 — generateActiveMask wrote
  mask_arr(i,j,k,...) and read dc_arr(i,j,k,0) inside a LoopOnCpu, which
  segfaults on both ends. Replaced with a ParallelFor for the mask write
  plus a ReduceOps<Sum,Sum,Sum> for the three debug counters; the
  std::atomic<long> counters that the LoopOnCpu was incrementing are
  unnecessary now that the reduction is one-shot.

  EffectiveDiffusivityHypre.cpp:532 — pin-cell search read mask_arr from
  host. Replaced with a ReduceOps<Min> over the linearised cell index of
  active cells (sentinel = LONG_MAX for inactive); the winning index is
  unpacked back to (i,j,k) on host.

NON-CRITICAL — fires only on opt-in paths, fixed for completeness:

  TortuosityHypre.cpp:660 — plotfile writeback (write_plotfile=True only)
  TortuosityHypre.cpp:709 — failed-solve plotfile (write_plotfile=True
                            AND solver did not converge)

Same staging recipe everywhere: if AMREX_USE_GPU, copy the host buffer
into a Gpu::DeviceVector first and ParallelFor with a manually-computed
linear index lin = (k - lo.z)*nx*ny + (j - lo.y)*nx + (i - lo.x); else
just point src_ptr at the host buffer and the same ParallelFor expands
to a serial/OMP host loop. Trailing streamSynchronize after each MFIter
loop ensures all writes are complete before the next downstream operation
(FillBoundary, FlushBoundary, plotfile writer).

Verified clang-format passes idempotently.

Together with 64306d4 (VoxelImage) and 61cf635 (FloodFill) this should
take the GPU build from "segfaults at every entry point" to "fully
functional"; oi.tortuosity, oi.effective_diffusivity, oi.percolation_check,
and oi.volume_fraction should all run end-to-end on Colab T4 once 4.2.12
is published.

Skipped intentionally:
  TortuosityHypre.cpp:984 — checkMatrixProperties() debug-only function
                            requires a host-side iMultiFab copy first; not
                            in any standard call path
  TortuosityDirect.cpp    — legacy Forward Euler solver, deprecated path

https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Performance Benchmark Results

Size Solver Wall Time (s) Tortuosity Expected Rel. Error Iters Status
64³ pcg 0.6898 0.984375 0.984375 0.00e+00 1 PASS
64³ flexgmres 0.4369 0.984375 0.984375 0.00e+00 N/A PASS
64³ bicgstab 0.4344 0.984375 0.984375 0.00e+00 N/A PASS
64³ gmres 0.4299 0.984375 0.984375 0.00e+00 N/A PASS
128³ pcg 8.0556 0.992188 0.992188 0.00e+00 1 PASS
128³ flexgmres 5.7030 0.992188 0.992188 0.00e+00 N/A PASS
128³ bicgstab 5.6681 0.992188 0.992188 0.00e+00 N/A PASS
128³ gmres 5.6673 0.992188 0.992188 0.00e+00 N/A PASS

Fastest solver: gmres at 64³ (0.4299s)

Benchmark: uniform block (analytical τ = (N-1)/N)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Code Coverage Report

------------------------------------------------------------------------------
                           GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File                                       Lines     Exec  Cover   Missing
------------------------------------------------------------------------------
src/io/CathodeWrite.cpp                       95       83    87%   40-41,97-100,115-116,182-185
src/io/CathodeWrite.H                          1        1   100%
src/io/DatReader.cpp                         135      105    77%   26-27,30,35,92-93,99-100,107-109,135-137,141,144-148,152-155,162,164,208-209,242,245
src/io/DatReader.H                             1        1   100%
src/io/HDF5Reader.cpp                        344       84    24%   40-41,43-44,46-49,52,54-56,58-59,62,64-66,68-74,92-93,126-128,144-145,154-157,174-180,182-187,204,213-215,217,219-228,230-233,236-238,240-251,253-258,266,266,266,266,266,266,266,270,270,270,270,270,270,270,274,276,278,280,282,288,290,297,297,297,297,297,297,297,301,301,301,301,301,301,301,305,305,305,305,305,305,305-306,306,306,306,306,306,306,309,309,309,309,309,309,309-310,310,310,310,310,310,310-311,311,311,311,311,311,311,313,313,313,313,313,313,313-314,314,314,314,314,314,314-315,315,315,315,315,315,315,319,319,319,319,319,319,319,324,324,324,324,324,324,324-325,325,325,325,325,325,325-326,326,326,326,326,326,326-327,327,327,327,327,327,327,332,332,332,332,332,332,332,337,337,337,337,337,337,337-338,338,338,338,338,338,338,343,343,343,343,343,343,343,350,350,350,350,350,350,350,357-358,432-435,437-440
src/io/HDF5Reader.H                            3        3   100%
src/io/ImageLoader.cpp                        61       42    68%   25,38,48,60-62,64-70,72,77,89-90,92,94
src/io/RawReader.cpp                         266      135    50%   49-50,89-90,111-112,115-117,120-121,140-142,155-157,166-168,174-177,185-186,192-196,200-204,209-212,219-224,231-237,271,273-274,276,283-284,301,312,314,318,325,327,331-334,338,346-347,353-355,361-363,365-366,369,372,374,377-380,382-384,386,388-389,391,393-394,396,398-399,401,403-404,406,410-411,413,417-418,420,425,465,471-472,521-524,538,540-542,544,546-548,558,562-564,566,588
src/io/RawReader.H                             1        1   100%
src/io/TiffReader.cpp                        384      130    33%   59-65,67-69,71-73,75-77,79-80,82-84,86-88,90-92,94-96,98-99,101-103,106-108,111-112,114-117,119,122,124-127,143-144,148-150,152-158,160,186,210,217,226,228-231,240,242-245,248,255,288-293,306,309-317,319-320,323-327,331-335,338-342,344-348,351-357,359-363,367,369,375-377,379-393,396,398-402,404-409,413-418,420-425,428-429,432-434,555-575,577-578,581-588,590,593-609,612-614,670,673-674,677-683,685,689-700,702-703
src/io/TiffReader.H                            5        5   100%
src/props/BoundaryCondition.H                131       74    56%   63,68,70,216,224-229,233-236,238-244,247-249,252-253,255,258-261,264-265,271-272,274-279,285-287,290-296,299,303,365-366,371,373
src/props/ConnectedComponents.cpp             69       67    97%   94-95
src/props/ConnectedComponents.H                4        4   100%
src/props/DeffTensor.cpp                      62       59    95%   122,128-129
src/props/Diffusion.cpp                      510      378    74%   93-94,97-98,103-104,106-116,118,123-132,134-141,144-150,153-157,159-163,165,168-173,175-177,179,182-184,186-187,190-191,193,195-198,200,202-203,288-289,297-298,300,349,359-360,368-371,373-375,404-413,415,453,461,465-467,526-527,533,535,539,547,581,610,638,646,735-736,739-740,757-760,771-772,774,824
src/props/EffDiffFillMtx.H                   120      106    88%   58,216-217,221-225,229,231-235
src/props/EffectiveDiffusivityHypre.cpp      413      372    90%   189-191,193-197,352-355,464,616-619,621-623,625-628,637-640,647,676,688-691,693-695,697,709,720,722
src/props/EffectiveDiffusivityHypre.H          7        7   100%
src/props/FloodFill.cpp                       89       86    96%   94-95,235
src/props/HypreStructSolver.cpp              343      210    61%   87-88,121,133-134,145,299,309,311,314,346,356,358,361,367-370,372-376,378-379,381-385,388-389,391-392,394,397-398,401-402,404-407,409-413,415-416,418-422,425-426,428-429,431,434-435,438-439,441-443,445-451,453-457,460-461,463-464,466,469-470,473,475-477,479-485,487-491,494-495,497-498,500,503-504,507,509-511,513-516,518-522,525-526,528-529,531,534-535,538,541-542,555
src/props/HypreStructSolver.H                  6        6   100%
src/props/MacroGeometry.H                     17       17   100%
src/props/ParticleSizeDistribution.cpp        11       11   100%
src/props/ParticleSizeDistribution.H           6        6   100%
src/props/PercolationCheck.cpp                53       46    86%   32-33,49-51,68,73
src/props/PercolationCheck.H                   4        4   100%
src/props/PhysicsConfig.H                     90       89    98%   150
src/props/ResultsJSON.H                      225      222    98%   242,395,416
src/props/REVStudy.cpp                       151      128    84%   72,83-91,159,170-173,175,183-186,188-190
src/props/SolverConfig.H                      32       20    62%   30,32,37-44,75-76
src/props/SpecificSurfaceArea.cpp             56       55    98%   59
src/props/SpecificSurfaceArea.H                6        6   100%
src/props/ThroughThicknessProfile.cpp         38       38   100%
src/props/ThroughThicknessProfile.H            5        5   100%
src/props/Tortuosity.H                         2        2   100%
src/props/TortuosityDirect.cpp               219      191    87%   81-83,86,100-106,113-114,125,134,140,202-209,226,394,424,433
src/props/TortuosityDirect.H                   5        5   100%
src/props/TortuosityHypre.cpp                794      566    71%   149-150,155-156,240-243,246-248,311,335-337,340-341,343,353-355,358-360,390-393,573,597,601,622,639-640,642-644,646-655,657,669,671-681,685-691,693-697,701-703,705-707,709-718,727,729-739,743-751,753-756,758,768,774-777,779-781,790-793,795-797,813,816-817,840-845,856-859,861,898,903-906,909-911,915-918,920,922-925,927,932-934,936,985,994,999,1002-1007,1023-1026,1040-1044,1049-1054,1064-1068,1073-1078,1083-1087,1090-1093,1100-1103,1114,1123,1125,1129,1131,1153,1199-1200,1286-1288,1414-1417
src/props/TortuosityHypre.H                   15       15   100%
src/props/TortuosityHypreFill.H              127       98    77%   85,203,205-212,237-239,241-245,247-248,250,252,255-256,258-262
src/props/TortuosityKernels.H                 97       53    54%   52,56-60,62-65,69-74,76-80,84-85,90,129,143,157,243,245-248,250-253,257-260,262-265
src/props/TortuosityMLMG.cpp                  99       91    91%   160,181-183,185-186,193,206
src/props/TortuosityMLMG.H                     1        1   100%
src/props/TortuositySolverBase.cpp           301      237    78%   70-72,74-75,94-101,104,106,142-145,200,203,205,255,280,298,327,391,394-396,398,406-409,411-417,422,427-429,435-436,438-440,454,460,464-465,467,478,492,496-498,500,502,506
src/props/TortuositySolverBase.H              13       13   100%
src/props/VolumeFraction.cpp                  25       25   100%
src/props/VolumeFraction.H                     4        4   100%
------------------------------------------------------------------------------
TOTAL                                       5446     3907    71%
------------------------------------------------------------------------------


Generated by CI — coverage data from gcovr

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 70.21277% with 28 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/props/TortuosityHypre.cpp 33.33% 18 Missing ⚠️
src/props/EffectiveDiffusivityHypre.cpp 85.48% 0 Missing and 9 partials ⚠️
src/props/FloodFill.cpp 80.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant