Skip to content

VirtAPI: reuse a per-process cache in the Uncached calls (performance)#367

Closed
ramic-k wants to merge 1 commit into
mctools:mainfrom
ramic-k:virtapi-percall-cache
Closed

VirtAPI: reuse a per-process cache in the Uncached calls (performance)#367
ramic-k wants to merge 1 commit into
mctools:mainfrom
ramic-k:virtapi-percall-cache

Conversation

@ramic-k

@ramic-k ramic-k commented Jun 27, 2026

Copy link
Copy Markdown

A non-physics performance fix to NCrystal's VirtAPI_Type1_v1 implementation,
proposed for upstream. It removes a per-call allocation that dominates the
NCrystal-attributable cost when NCrystal is embedded in a host Monte Carlo transport
code such as OpenMC or McStas. It changes no public API and no physics. Measured
against OpenMC with NCrystal 4.4.4 on the model below, it gives ~11–12% transport
throughput with bit-identical results. The size of the gain is problem-dependent: it
scales with how much of the transport actually goes through NCrystal, so a more
NCrystal-heavy problem sees a larger gain (Section 4).

1. Motivation

When OpenMC drives thermal scattering through NCrystal, transport is slower than the
native ACE/S(α,β) path. For the model benchmarked here (graphite-moderated HEU,
where only the graphite treatment differs) the gap is ~16–18%.

That figure is specific to this problem, not a universal NCrystal overhead. The
slowdown, and therefore the benefit of this patch, scales with how much of the
transport runs through NCrystal: the number of NCrystal materials, and the fraction
of collisions below the thermal cutoff (5 eV) where NCrystal is invoked. A simple
problem with one small NCrystal region sees little; a complex, NCrystal-heavy problem
(many NCrystal materials, a well-thermalized spectrum) sees more. The graphite vs
graphite+water comparison in Section 4 shows this directly.

Profiling (sample) shows the extra cost is memory allocation, not scattering
physics. Outside the unavoidable native work (ACE nuclide cross sections, geometry),
the largest contributions are malloc/free (_xzm_free, __bzero, and similar;
about 1800 leaf samples) plus ProcImpl::CacheProcComp::reset.

The source is in ncrystal_core/src/virtualapi/NCVirtAPI_Type1_v1_impl.hh. The
Uncached entry points that the host calls construct a fresh CachePtr on every
call:

double crossSectionUncached( const PubScatterProcess& pub_sp, const double* n ) const override {
  auto sp = reinterpret_cast<const ScatterProcess*>(&pub_sp);
  CachePtr dummycache;//<--- Fully MT safe, fully inefficient. To be
                      //revisited in a future api version!
  return sp->procptr->crossSection( dummycache, ... ).dbl();
}

So each per-collision cross-section and scatter evaluation allocates, populates, and
frees an internal cache. The original comment already calls this out as inefficient.

2. The change

Reuse a persistent cache per (thread, process) rather than allocating one per call.
Each process keeps its own cache, keyed by the process pointer, in a thread_local
map:

double crossSectionUncached( const PubScatterProcess& pub_sp, const double* n ) const override {
  auto sp = reinterpret_cast<const ScatterProcess*>(&pub_sp);
  thread_local std::unordered_map<const ScatterProcess*, CachePtr> xs_caches;
  return sp->procptr->crossSection( xs_caches[sp], ... ).dbl();
}
// sampleScatterUncached: identical pattern with a separate scat_caches map

The full diff is in ncrystal_percall_cache.patch (one file, +14/-6). Only the
VirtAPI implementation changes; the frozen interface header
(NCVirtAPI_Type1_v1.hh) and all host code are untouched.

3. Correctness

  • The caches are never aliased between processes. Process::accessCache does not
    validate process identity, and the only cache-reset trigger
    (ProcComposition::m_nHistory) starts at 1 and increments only when components
    change, so two materials with the same number of components share an m_nHistory.
    A single shared cache would be reused across different processes and return wrong
    results. Keying the pool by the process pointer gives each process its own cache,
    so this cannot happen.
  • It is thread-safe: thread_local gives each thread its own map, with no sharing
    and no locks.
  • The cross-section and scatter caches are kept separate, matching the two
    independent dummycache objects in the original. This avoids mixing cache types
    between the two code paths.
  • Behaviour is unchanged. The cache contents come from the same
    crossSection/sampleScatter calls as before; only the cache's lifetime changes.
    Empirically, k is bit-identical (Section 4).

4. Benchmark

The build is reproducible from source. A conda-forge env provides
clang/clangxx_osx-arm64, cmake, ninja, and hdf5. NCrystal 4.4.4 (main,
8113afe3) is built both unpatched and patched. OpenMC develop (4d6244d93) is
built unmodified: there are no host-side changes, and the models use the existing
whole-material cfg= NCrystal feature (not freegas padding, and not any
per-element extension), so the test isolates the NCrystal patch against stock OpenMC.
The A/B swaps libNCrystal.dylib between the two builds, keeping the OpenMC binary
and the model fixed and running a single process (the per-call allocation is
per-process, so the relative speedup does not depend on MPI rank count).

The models live in pr_model/: three variants of ICSBEP HEU-COMP-THERM-016
(graphite-moderated HEU). hct016_native_ace.xml is the all-ACE S(α,β) baseline;
hct016_graphite_ncrystal.xml and hct016_graphite_water_ncrystal.xml are the two
NCrystal variants. All run with ENDF/B-VIII.1 at 296 K and OMP_NUM_THREADS=1. The
native baseline runs at ~4696 p/s, so unpatched NCrystal is ~14% slower and patched
NCrystal ~4% slower. The patch recovers roughly three-quarters of the
NCrystal-vs-native overhead.

graphite/water treatment unpatched (p/s) patched (p/s) speedup k (both)
graphite NCrystal, water ACE 4016 / 4103 4465 / 4574 +11.4% 1.00711 ± 0.00274
graphite + water NCrystal 4026 / 4021 4516 / 4522 +12.3% 1.00270 ± 0.00238

A few things to note from the table. k is bit-identical between unpatched and
patched, so the change is pure caching. The speedup grows as more materials use
NCrystal (graphite to graphite+water), because each NCrystal material adds calls
whose per-call allocation the patch removes. After patching, a re-profile shows the
malloc/free churn is gone, and the NCrystal path is then dominated by the same
native work as the ACE path.

A separate 14-rank MPI A/B of the same change gave ~+10%, consistent with the
single-process result (the speedup is per-process, so it does not depend on MPI rank
count).

5. Reproducing

git clone https://github.com/mctools/ncrystal && cd ncrystal     # main = 4.4.4
git apply /path/to/ncrystal_percall_cache.patch
cmake -S ncrystal_core -B build -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_INSTALL_PREFIX="$CONDA_PREFIX" && cmake --build build -j && cmake --install build
# build OpenMC `develop`; run with this NCrystal's ncrystal-config first on PATH
# (OpenMC locates the lib via `ncrystal-config --show shlibpath`)

The size of the gain depends on geometry and materials: more sub-cutoff NCrystal
collisions (more NCrystal materials, more moderator) give a larger gain.


🤖 Generated with Claude Code

The Type1_v1 Uncached entry points (crossSectionUncached, sampleScatterUncached)
allocated a fresh CachePtr on every call ("Fully MT safe, fully inefficient"). That
per-call allocation/rebuild dominates the NCrystal-attributable cost when NCrystal is
embedded in a host Monte Carlo code through the VirtAPI.

Reuse a persistent cache per (thread, process): a thread_local map keyed by the
process pointer, so the internal cache is allocated once per process per thread
instead of once per call. Keying by process pointer is required because accessCache
does not validate process identity and m_nHistory is not unique per process, so a
single shared cache could be applied to the wrong process. The cross-section and
scatter paths use separate maps, matching the two original CachePtr objects.
thread_local keeps it thread-safe without locks. No public API change, no physics
change.

Measured with OpenMC and NCrystal 4.4.4 on a graphite-moderated benchmark: +11-12%
transport throughput, with bit-identical k.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRiGki7G3D8QyixmYeMGiY
@tkittel

tkittel commented Jun 30, 2026

Copy link
Copy Markdown
Member

Thanks for the report @ramic-k (and Claude).

So I fully acknowledge the issue here, and in fact I could swear there was already an open issue on it since it is something I want to address.

However, the fix you propose is not the correct one. Usage of thread_local is not always fully portable, and might depend on how client code has build their binaries. Also, the client code should be the one deciding on complicated issues like where to store data of each thread and when it is necessary.

I asked @paulromano to clarify this on the OpenMC side when we were developing the interface there, and at that particular time he was not sure so we decided to simply move on.

Now, the correct solution will be to add a new virtual API (VirtAPI_Type1_v2 most likely), which in addition to the crossSectionUncached and sampleScatterUncached also have ...Cached variants of these methods. It will then be the responsiblity of the calling code to use the cloneScatter method to ensure each thread has its own clone (such clones will be light-weight and will essentially just reservesa dedicated CachePtr for use with that clone in ...Cached methods.

So I am going to close this PR and instead open an issue to track the need for this VirtAPI_Type1_v2. However, don't expect it to get high priority unless there are some indications from the OpenMC side that it is actually now clear at which point such per-thread cloning should take place.

@tkittel

tkittel commented Jun 30, 2026

Copy link
Copy Markdown
Member

New issue opened in #371, please continue any discussions there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants