profiling notebook: GPU-aware install, drop legacy probes, polish prose#269
Merged
jameslehoux merged 2 commits intomasterfrom May 5, 2026
Merged
profiling notebook: GPU-aware install, drop legacy probes, polish prose#269jameslehoux merged 2 commits intomasterfrom
jameslehoux merged 2 commits intomasterfrom
Conversation
added 2 commits
May 5, 2026 15:20
… mem)
Reported on Colab T4: notebook §3 crashes the kernel silently at
`_core.VoxelImage.from_numpy(arr, max_grid_size)`. Root cause: the
binding was filling the new iMultiFab with a CPU loop:
auto fab = img->mf->array(mfi); // Array4<int> view into iMultiFab data
for (k...) for (j...) for (i...)
fab(i, j, k) = ptr[idx]; // host write — but in CUDA mode this
// is DEVICE memory → segfault
Works fine on CPU builds (where iMultiFab data is host memory) but every
attempt to ingest a NumPy array on the GPU build dies before any Python
print can flush. Affects every entry point that goes through from_numpy
— so high-level oi.tortuosity / oi.percolation_check / oi.volume_fraction
on the GPU wheel were ALL broken.
Switch to the AMReX idiom: stage the host data in a Gpu::DeviceVector,
then use ParallelFor with an AMREX_GPU_DEVICE lambda so the assignment
runs on the actual hardware that owns the iMultiFab memory.
* CPU build (#ifndef AMREX_USE_GPU): src_ptr is the host pointer and
ParallelFor expands to a serial/OMP host loop. Identical behaviour
to before.
* GPU build (#ifdef AMREX_USE_GPU): copy host → device once via
Gpu::copyAsync + streamSynchronize, then ParallelFor launches a
kernel that reads from the device buffer and writes through the
Array4<int> view to its own memory.
Final streamSynchronize before FillBoundary makes sure the kernel is
done before the ghost-cell exchange reads it.
Note: this fix requires a wheel rebuild + new release. The pure-Python
preload helper in 4.2.9 already lets _core.so load on GPU runtimes;
this change makes it actually USABLE on them.
https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
The profiling notebook had grown a few rough edges as we iterated:
1. The install cell hard-coded `pip install openimpala`, ignoring the
GPU runtime entirely. On a Colab T4 every solve was running on the
CPU — exactly the failure mode §1a is supposed to detect. Rewrite
to mirror tutorials/02/04/07: detect nvidia-smi, install
openimpala-cuda on GPU runtimes, fall back to openimpala otherwise.
2. §1a's build_info() probe carried a 60-line subprocess banner
fallback for "pre-4.0.2 wheels". The published wheel has been at
4.2.x for some time now and build_info() is in every wheel we
publish. Drop the legacy fallback and the AttributeError handler.
3. Drop colloquialisms in markdown headers and prose:
- "Profiling & Bottleneck Hunt" -> "Profiling and Performance Tuning"
- "exists for one job" / "If you only have 5 minutes" — removed
- "*(focused, post-diagnosis)*" / "*(optional)*" parentheticals
- "wonky workaround" / "the *classic* signature" — recast neutrally
- The "issue #256 acceptance target" reference in §9b — replaced
with an explanation that scales naturally to readers who weren't
following the issue thread
4. Add a proper contents table to the intro and clean up §2/§5/§6/§7/§9
intros with consistent structure: one-paragraph context, bullet
list of what to look for, one paragraph on interpretation.
5. Fix a latent bug in §12's recommendation: the CPU-detection branch
checked `backend == "cpu"`, but build_info()'s actual values are
`cpp-cpu` / `cpp-cuda` / `cpp-hip` / `pure-python`. The check never
fired in practice. Switch to the already-computed `is_gpu_build`
flag so the "rebuild for GPU" recommendation actually appears when
relevant.
No analysis logic changes — every code cell that produced a chart, a
fit, or a number still does so identically. This is purely a polish
pass to make the notebook read as a tool rather than a scratchpad.
https://claude.ai/code/session_011dJ5Bwq4Tnr8wxH597XJFf
Performance Benchmark Results
Fastest solver: bicgstab at 64³ (0.3921s) Benchmark: uniform block (analytical τ = (N-1)/N) |
Code Coverage ReportGenerated by CI — coverage data from gcovr |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The profiling notebook had grown a few rough edges as we iterated:
The install cell hard-coded
pip install openimpala, ignoring theGPU runtime entirely. On a Colab T4 every solve was running on the
CPU — exactly the failure mode §1a is supposed to detect. Rewrite
to mirror tutorials/02/04/07: detect nvidia-smi, install
openimpala-cuda on GPU runtimes, fall back to openimpala otherwise.
§1a's build_info() probe carried a 60-line subprocess banner
fallback for "pre-4.0.2 wheels". The published wheel has been at
4.2.x for some time now and build_info() is in every wheel we
publish. Drop the legacy fallback and the AttributeError handler.
Drop colloquialisms in markdown headers and prose:
with an explanation that scales naturally to readers who weren't
following the issue thread
Add a proper contents table to the intro and clean up §2/§5/§6/§7/§9
intros with consistent structure: one-paragraph context, bullet
list of what to look for, one paragraph on interpretation.
Fix a latent bug in §12's recommendation: the CPU-detection branch
checked
backend == "cpu", but build_info()'s actual values arecpp-cpu/cpp-cuda/cpp-hip/pure-python. The check neverfired in practice. Switch to the already-computed
is_gpu_buildflag so the "rebuild for GPU" recommendation actually appears when
relevant.
No analysis logic changes — every code cell that produced a chart, a
fit, or a number still does so identically. This is purely a polish
pass to make the notebook read as a tool rather than a scratchpad.