VirtAPI: reuse a per-process cache in the Uncached calls (performance)#367
VirtAPI: reuse a per-process cache in the Uncached calls (performance)#367ramic-k wants to merge 1 commit into
Conversation
The Type1_v1 Uncached entry points (crossSectionUncached, sampleScatterUncached)
allocated a fresh CachePtr on every call ("Fully MT safe, fully inefficient"). That
per-call allocation/rebuild dominates the NCrystal-attributable cost when NCrystal is
embedded in a host Monte Carlo code through the VirtAPI.
Reuse a persistent cache per (thread, process): a thread_local map keyed by the
process pointer, so the internal cache is allocated once per process per thread
instead of once per call. Keying by process pointer is required because accessCache
does not validate process identity and m_nHistory is not unique per process, so a
single shared cache could be applied to the wrong process. The cross-section and
scatter paths use separate maps, matching the two original CachePtr objects.
thread_local keeps it thread-safe without locks. No public API change, no physics
change.
Measured with OpenMC and NCrystal 4.4.4 on a graphite-moderated benchmark: +11-12%
transport throughput, with bit-identical k.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRiGki7G3D8QyixmYeMGiY
|
Thanks for the report @ramic-k (and Claude). So I fully acknowledge the issue here, and in fact I could swear there was already an open issue on it since it is something I want to address. However, the fix you propose is not the correct one. Usage of thread_local is not always fully portable, and might depend on how client code has build their binaries. Also, the client code should be the one deciding on complicated issues like where to store data of each thread and when it is necessary. I asked @paulromano to clarify this on the OpenMC side when we were developing the interface there, and at that particular time he was not sure so we decided to simply move on. Now, the correct solution will be to add a new virtual API ( So I am going to close this PR and instead open an issue to track the need for this |
|
New issue opened in #371, please continue any discussions there. |
A non-physics performance fix to NCrystal's
VirtAPI_Type1_v1implementation,proposed for upstream. It removes a per-call allocation that dominates the
NCrystal-attributable cost when NCrystal is embedded in a host Monte Carlo transport
code such as OpenMC or McStas. It changes no public API and no physics. Measured
against OpenMC with NCrystal 4.4.4 on the model below, it gives ~11–12% transport
throughput with bit-identical results. The size of the gain is problem-dependent: it
scales with how much of the transport actually goes through NCrystal, so a more
NCrystal-heavy problem sees a larger gain (Section 4).
1. Motivation
When OpenMC drives thermal scattering through NCrystal, transport is slower than the
native ACE/
S(α,β)path. For the model benchmarked here (graphite-moderated HEU,where only the graphite treatment differs) the gap is ~16–18%.
That figure is specific to this problem, not a universal NCrystal overhead. The
slowdown, and therefore the benefit of this patch, scales with how much of the
transport runs through NCrystal: the number of NCrystal materials, and the fraction
of collisions below the thermal cutoff (5 eV) where NCrystal is invoked. A simple
problem with one small NCrystal region sees little; a complex, NCrystal-heavy problem
(many NCrystal materials, a well-thermalized spectrum) sees more. The graphite vs
graphite+water comparison in Section 4 shows this directly.
Profiling (
sample) shows the extra cost is memory allocation, not scatteringphysics. Outside the unavoidable native work (ACE nuclide cross sections, geometry),
the largest contributions are
malloc/free(_xzm_free,__bzero, and similar;about 1800 leaf samples) plus
ProcImpl::CacheProcComp::reset.The source is in
ncrystal_core/src/virtualapi/NCVirtAPI_Type1_v1_impl.hh. TheUncachedentry points that the host calls construct a freshCachePtron everycall:
So each per-collision cross-section and scatter evaluation allocates, populates, and
frees an internal cache. The original comment already calls this out as inefficient.
2. The change
Reuse a persistent cache per (thread, process) rather than allocating one per call.
Each process keeps its own cache, keyed by the process pointer, in a
thread_localmap:
The full diff is in
ncrystal_percall_cache.patch(one file, +14/-6). Only theVirtAPI implementation changes; the frozen interface header
(
NCVirtAPI_Type1_v1.hh) and all host code are untouched.3. Correctness
Process::accessCachedoes notvalidate process identity, and the only cache-reset trigger
(
ProcComposition::m_nHistory) starts at 1 and increments only when componentschange, so two materials with the same number of components share an
m_nHistory.A single shared cache would be reused across different processes and return wrong
results. Keying the pool by the process pointer gives each process its own cache,
so this cannot happen.
thread_localgives each thread its own map, with no sharingand no locks.
independent
dummycacheobjects in the original. This avoids mixing cache typesbetween the two code paths.
crossSection/sampleScattercalls as before; only the cache's lifetime changes.Empirically,
kis bit-identical (Section 4).4. Benchmark
The build is reproducible from source. A conda-forge env provides
clang/clangxx_osx-arm64,cmake,ninja, andhdf5. NCrystal 4.4.4 (main,8113afe3) is built both unpatched and patched. OpenMCdevelop(4d6244d93) isbuilt unmodified: there are no host-side changes, and the models use the existing
whole-material
cfg=NCrystal feature (notfreegaspadding, and not anyper-element extension), so the test isolates the NCrystal patch against stock OpenMC.
The A/B swaps
libNCrystal.dylibbetween the two builds, keeping the OpenMC binaryand the model fixed and running a single process (the per-call allocation is
per-process, so the relative speedup does not depend on MPI rank count).
The models live in
pr_model/: three variants of ICSBEP HEU-COMP-THERM-016(graphite-moderated HEU).
hct016_native_ace.xmlis the all-ACES(α,β)baseline;hct016_graphite_ncrystal.xmlandhct016_graphite_water_ncrystal.xmlare the twoNCrystal variants. All run with ENDF/B-VIII.1 at 296 K and
OMP_NUM_THREADS=1. Thenative baseline runs at ~4696 p/s, so unpatched NCrystal is ~14% slower and patched
NCrystal ~4% slower. The patch recovers roughly three-quarters of the
NCrystal-vs-native overhead.
A few things to note from the table.
kis bit-identical between unpatched andpatched, so the change is pure caching. The speedup grows as more materials use
NCrystal (graphite to graphite+water), because each NCrystal material adds calls
whose per-call allocation the patch removes. After patching, a re-profile shows the
malloc/freechurn is gone, and the NCrystal path is then dominated by the samenative work as the ACE path.
A separate 14-rank MPI A/B of the same change gave ~+10%, consistent with the
single-process result (the speedup is per-process, so it does not depend on MPI rank
count).
5. Reproducing
The size of the gain depends on geometry and materials: more sub-cutoff NCrystal
collisions (more NCrystal materials, more moderator) give a larger gain.
🤖 Generated with Claude Code