Skip to content

Teoc properly separate out#28

Draft
wraith1995 wants to merge 127 commits into
masterfrom
teoc-properly-separate-out
Draft

Teoc properly separate out#28
wraith1995 wants to merge 127 commits into
masterfrom
teoc-properly-separate-out

Conversation

@wraith1995

Copy link
Copy Markdown

No description provided.

wraith1995 and others added 30 commits June 22, 2026 09:08
…tatestruct

Carve the iterative-solver and reduced-basis preconditioner transient state out
of the commonstruct god-struct into a named, nested solverstatestruct member.
Eight mutable fields move: RBcurrentdim, RBremovedind, Wcurrentdim,
linearSolverIter, nonlinearSolverIter, linearSolverTolFactor,
linearSolverRelError, PTCparam. 142 access sites migrated
common.X -> common.solverstate.X across both the ABI (.cpp) and template (.hpp)
implementations; commonstruct::printinfo bare references fixed.

This names and isolates the solver's mutable runtime state from the read-only
configuration it used to ride inside (first cut of the concern-separation
refactor; the pervasive time-stepping state time/currentstep/currentstage/
dtfactor is deferred to a TimeState group in C3).

No wire-format or driver-ABI change. Verified: frontend+consumer ctest 13/13;
poisson coupling QoI byte-identical (1.806679e-01 3.871878e-01).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First step of the "QoI template, many instances" model. Add qoiinstancestruct
(name, kind, boundary, offset, ncomp) and a commonstruct.qoiinstances list;
setstructs synthesizes the historical default (one Domain instance over all
volume QoIs, one Boundary instance over all surface QoIs on common.ibs). The QoI
column headers and value rows are now written by shared instance-driven helpers
writeQoIHeader/writeQoIRow, replacing the six hand-rolled nvqoi/nsurf loops
across both the ABI (.cpp) and template (.hpp) solution/postsolution paths.

Compute (qoiElement/qoiFace) is unchanged here; this is the data-model + output
seam that lets a program register multiple named QoI instances. Default output
is byte-identical. Verified: ctest 13/13; poisson coupling QoI byte-identical
(Domain_QoI1=1.806679e-01, Domain_QoI2=3.871878e-01, Boundary_QoI1=0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Complete the "QoI template, many instances" model with a user-facing handle:
ExasimSolver::AddQoI(modelindex, name, kind, boundary, offset, ncomp) and
ClearQoI(modelindex), callable after Initialize and before Solve. A program can
now register several independently-named QoIs, each owning a slice of the model
QoI-kernel output; out-of-range slices are rejected so blind registration is safe.

The QoI header line is now written lazily on the first SaveQoI row
(writeQoIHeaderOnce) instead of at file-open, so instances registered after model
init are reflected in the columns (byte-identical for the default instances).

The builtin consumer gains an EXASIM_QOI_DEMO path that registers three named
instances (KineticEnergy, Dissipation, WallFlux) purely through the public API;
output header becomes "KineticEnergy1 Dissipation1 WallFlux1" vs the default
"Domain_QoI1 Domain_QoI2 Boundary_QoI1", same values.

Deferred (tracked): per-instance multi-boundary compute (needs multi-ib QoI
kernels from the frontend) and MPI_Allreduce of QoI values (collective-safety
review). Verified: ctest 13/13; coupling QoI byte-identical; demo confirmed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First twin unified. The non-templated ABI qoicalculation.cpp is removed; the
templated qoicalculation.hpp is now the single source, instantiated with
exasim::detail::AbiAdapter on the production path and with a concrete model M on
the compile-time-specialized path (facade consumer).

Mechanism (reusable for the remaining 24 twins):
- commonstruct gains an ExasimDriverABI* driver_abi, set by CDiscretization, so
  the templated FEM code can reach the runtime ABI without threading it.
- model_drivers_abi.cpp gains no-driver_abi overloads of QoIvolumeDriver/
  QoIboundaryDriver that recover the ABI from common.driver_abi and forward to the
  explicit-ABI versions; the EXASIM_DRIVER_CALL AbiAdapter branch binds to these.
- non-templated callers (solution.cpp, postsolution.cpp) call qoiElement/qoiFace
  with explicit <AbiAdapter>.
- install/CMakeLists.txt puts the repo root on the in-tree include path so the
  <exasim/...> umbrella headers resolve (matching out-of-tree consumers).

Verified: build green; ctest 13/13 (incl. facade template-path consumer);
coupling QoI byte-identical (1.806679e-01 / 3.871878e-01).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…vers

Scaffolding for the twin sweep. Adds inline no-driver_abi overloads for the 36
solve-loop kernel drivers (Flux/Source/Fbou/Ubou/Fhat/Uhat/Tdfunc/Avfield/EoS/
Vis*/QoI*/Fint/Fext/Output/Monitor/... and their Jacobian variants) that recover
the model ABI from common.driver_abi and forward to the explicit-ABI versions.
These bind the EXASIM_DRIVER_CALL AbiAdapter branch, so unifying each twin onto
its templated .hpp no longer needs per-file overloads -- only an include switch,
<AbiAdapter> on non-templated callers, and deleting the .cpp.

Generated mechanically from the explicit-ABI signatures (uniform: every solve-loop
driver takes commonstruct& common). The 10 init-path drivers (Init*/cpuInit*) take
no common and are handled when their files are unified. Replaces the two
hand-written QoI overloads from the pilot.

No behavior change (overloads are additive; only qoicalculation uses them so far).
Verified: build green; ctest 13/13.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove getuhat.cpp; getuhat.hpp is the single source (GetUhat/GetdUhat/UhatBlock/
dUhatBlock). residual.cpp now includes the .hpp; its 11 GetUhat/GetdUhat call
sites plus 3 in ldgblockjacobian.cpp and 1 in discretization.cpp now use explicit
<exasim::detail::AbiAdapter> (dropping the driver_abi argument; UbouDriver is
served by the no-driver_abi overload). Verified: ctest 13/13; coupling QoI
byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove qresidual.cpp; templated qresidual.hpp is the single source (RqElem/RqFace/
dRqElem/dRqFace + blocks). residual.cpp includes the .hpp and its calls use explicit
<exasim::detail::AbiAdapter>. qresidual uses no kernel drivers. ctest 13/13; coupling
byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove uresidual.cpp; templated uresidual.hpp is the single source. Includers switched to the
.hpp; non-templated callers use <exasim::detail::AbiAdapter> (kernel drivers served by
the no-driver_abi overloads). ctest 13/13; coupling byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d sources

The solve-path files are mutually recursive via cross-file <M> template calls
(matvec->uequation/residual, uequation->wequation, residual->wequation/qequation),
so they can't be unified independently -- landed as one atomic cluster:
remove wequation/qequation/uequation/residual/matvec/massinv .cpp; the templated
.hpp are the single source; non-templated callers (solution, discretization,
postdiscretization, ldgblockjacobian, gmres, ptcsolver, preconditioner, ...) use
<exasim::detail::AbiAdapter> with the driver_abi argument dropped.

Fixes uncovered en route:
- EXASIM_LEGACY_W_CALL now dispatches the HDG w-equation source through
  common.driver_abi for the AbiAdapter path (was a bare global call that resolved
  to a per-model kernel, wrong for runtime ABI dispatch); wequation.hpp now
  includes the dispatch header so the macro is defined.
- Reconciled twin drift: restored EnsureTemplateAllocation (lived in qequation.cpp,
  absent from .hpp) into qequation.hpp.
- massinv is a non-template helper, not a template twin: its .cpp carried an unused
  driver_abi param; callers now use massinv.hpp's signature (no driver_abi, no
  template argument).
- caller rewriter now strips qualified disc.driver_abi / this->driver_abi too.

Verified: ctest 13/13; coupling QoI byte-identical (1.806679e-01 / 3.871878e-01).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The backend class .hpp (discretization, solution, solver, preconditioner,
visualization) and their .hpp helpers (postdiscretization, postsolution,
avsolution, previoussolutions, timestepcoeff, updatesolution, updatesource,
gmres, ptcsolver, getpoly, postsolver, setsysstruct, postpreconditioner,
applymatrix, setprecondstruct) are DEAD: nothing in the cmake build includes
them (production compiles the non-template .cpp/.h classes; the live
compile-time-specialized template path is the separate include/exasim/ stack
used by the facade via <exasim/model.hpp>). Proven: the library builds with
#error probes in all five class .hpp and zero hits; every includer of these
files is itself in the dead set.

These are an abandoned earlier template attempt -- the very ABI/template twin
duplication this refactor targets -- and several don't even compile (they
define CFoo<M>:: methods against a non-template class CFoo). Removing them
eliminates the stale duplication for the class layer (the free-function layer
was unified onto its templates earlier). Production classes stay non-template
(runtime ABI dispatch), which already satisfies "CSolver/CPreconditioner are
not templated on the model".

20 files removed. No build impact: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last two true ABI/template twins. pointlocator.hpp/pointwallmodel.hpp are dead
(in no depfile; included only by each other). Removing them completes the twin
handling. No build impact: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…monstruct

Group the mutable time-stepping runtime fields (currentstep, currentstage,
dtfactor, time) into a nested timestatestruct member, alongside C1's solverstate.
Together they isolate the simulation's mutable runtime state from the read-only
configuration in the god-struct. 413 access sites migrated common.X ->
common.timestate.X across backend/ and include/exasim/ (the live template stack,
compiled via the unified free-function headers); printinfo bare refs fixed.

First commonstruct-decomposition step on the now-unified (post-U) single codebase.
No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the physics/model configuration (appname, source, convStabMethod,
diffStabMethod, rotatingFrame, viscosityModel, SGSmodel, ALEflag, ncAV,
AVsmoothingIter, frozenAVflag, AVdistfunction, rampFactor, tau0) into a nested
physicsparamsstruct member. 75 access sites migrated common.X ->
common.physicsparams.X across backend/ and include/exasim/; printinfo bare refs
fixed. (read_uh deferred: it collides with an appstruct field.)

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the iterative-solver configuration (linearSolver, nonlinearSolver,
linear/nonlinearSolverMaxIter, matvecOrder, gmresRestart, gmresOrthogMethod,
preconditioner, precMatrixType, ptcMatrixType, RBdim, Wdim, matvecTol,
linear/nonlinearSolverTol) into a nested solverparamsstruct member. 96 access
sites migrated common.X -> common.solverparams.X across backend/ and
include/exasim/; printinfo bare refs fixed. (Mutable per-solve counters stay in
solverstate from C1.)

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the time-integration / problem-evolution configuration (temporalScheme,
torder, tstages, tsteps, dae_steps, dae_alpha/beta/gamma/epsilon, tdep, wave,
tdfunc, linearProblem, subproblem) into a nested timeparamsstruct member. 248
access sites migrated common.X -> common.timeparams.X across backend/ and
include/exasim/; printinfo bare refs fixed. (Mutable step/stage counters stay in
timestate; dt/dae_dt are appstruct raw-wire pointers, untouched.)

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the QoI / visualization-output configuration (nsca, nvec, nten, nsurf,
nvqoi, saveParaview, ibs, the qoivolume/qoisurface accumulation buffers, and the
C2 qoiinstances vector) into a nested qoiparamsstruct member. 109 access sites
migrated common.X -> common.qoiparams.X across backend/ and include/exasim/; the
writeQoI* helpers + freememory() + printinfo bare refs fixed. Folds the C2
instance machinery under the same concern.

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the interface-coupling configuration (ncuext, coupledinterface,
coupledcondition, coupledboundarycondition, FextCall, ncie, extUhat, extFhat,
extStab, ninterfacefaces, ndofuhatinterface, nintfaces, nvindx) into a nested
couplingparamsstruct member. 135 access sites migrated common.X ->
common.couplingparams.X across backend/ and include/exasim/; printinfo bare refs
fixed. The raw interface arrays (vindx/interfacefluxmap/intepartpts) are shared
with appstruct and left in place; wall-model and synthetic-turbulence fields are
separate concerns (left for their own groups).

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the output/checkpoint/IO configuration (saveSolFreq, saveSolOpt,
saveRestart, saveSolBouFreq, timestepOffset, compudgavg, readudgavg,
saveResNorm, fileoffset, debugMode) into a nested outputparamsstruct member.
228 access sites migrated common.X -> common.outputparams.X across backend/ and
include/exasim/; printinfo bare refs fixed.

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two small feature concerns: wallmodelparamsstruct (nwm, szwmModelIDs,
szwmBoundaries, szwmDistances, wmModelIDs, wmBoundaries, wmDistances) and
stgparamsstruct (stgNmode, nstgib, stgib). 38 access sites migrated common.X ->
common.wallmodelparams.X / common.stgparams.X across backend/ and include/exasim/;
printinfo + freememory bare refs fixed (region-guarded so the appstruct-mirrored
raw tables are untouched).

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the ~17 degree-of-freedom counts (ndof, ndofq, ndofw, ndofuhat, ndofudg,
ndofsdg, ndofodg, ndofedg, ndofbou, ndofucg, and the ndof*1 interior variants)
into a nested sizesstruct member -- the derived "sizes view": cross-products of
the base dimensions, computed once in setstructs and read throughout. 225 access
sites migrated common.X -> common.sizes.X across backend/ and include/exasim/;
printinfo bare refs fixed. Kept as STORED (not recomputed on access): ndof1*
switch between ne1 and ne by mesh branch, and ndofucg/ndofbou are not simple
products, so a naive computed view would be wrong. Base dimensions (nd,
nc-family, npe, ne, ...) stay top-level as the fundamental shape.

Completes C3: commonstruct now opens with 11 nested concern groups (solverstate,
timestate, physicsparams, solverparams, timeparams, qoiparams, couplingparams,
outputparams, wallmodelparams, stgparams, sizes).

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The LDG block-Jacobi preconditioner computation lived in CDiscretization but is a
preconditioner responsibility. Moved the method to CPreconditioner (it takes the
discretization by reference and reads disc.sol/res/app/master/mesh/tmp/common,
exactly as before). BlockJacobianLDG/mpiBlockJacobianLDG remain in
ldgblockjacobian.cpp and stay visible via the unity build (ExasimSolver.cpp
includes discretization.cpp before preconditioner.cpp). Callers in CSolution::DIRK
updated disc.ComputeLDGPreconditioner(...) -> prec.ComputeLDGPreconditioner(disc, ...).

First C4 re-homing. (res.K is still shared scratch in resstruct -- moving the
preconditioner matrix storage into precondstruct is a deeper follow-up.)

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the commonstruct C3 decomposition in the model templating system. The
monolithic CRTP base ModelDefaults<Self> (model.hpp) provided zero/identity
defaults for every concern in one struct; split it into a chain of per-concern
default mixins, each adding ONE concern's surface:

  ModelConstants<Self>            (disc, nco)
    -> VolumeDefaults<Self>       (source, sourcew, tdfunc, avfield)
    -> BoundaryDefaults<Self>     (ubou, fbou, fbou_hdg, fhat, uhat, stab)
    -> EoSDefaults<Self>          (eos, eos_du, eos_dw)
    -> InitDefaults<Self>         (initq, initwdg, initudg, initodg)
    -> OutputDefaults<Self>       (vis_*, qoi_*, monitor, output)
    -> HDGJacobianDefaults<Self>  (the HDG pointwise Jacobians)
  ModelDefaults<Self> : HDGJacobianDefaults<Self> {}   // composite, public API

Prerequisite: extracted the zero_fill_ member helper into a free
detail::zero_fill<N>() so the mixins can call it without depending on a shared
base (avoids two-phase lookup on a dependent base). Users still derive from
ModelDefaults<Self> and inherit every default unchanged; each default now lives
in the mixin for its concern, and a kernel/class can name just the mixin whose
surface it needs. All defaults reference constants via Self:: only, so the chain
is behavior-preserving.

No wire/ABI change. Verified: ctest 13/13 (facade consumer compiles the templated
model path); coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s concern

Propagate the model split to the kernels. Added void_t method detectors + 13
per-capability concept traits in model.hpp (is_flux/source/sourcew/tdfunc/avfield/
eos/boundary/hdg_boundary/interface/init/qoi/output/vis_model_v), each = is_model_v
&& the methods that concern's kernels actually call. Replaced the blanket
static_assert(is_model_v<M>) in every kernel (include/exasim/kernels/*.hpp) with the
narrow trait for the concern it consumes: flux->is_flux, boundary fbou/ubou->is_boundary
and the hdg_fbou variants->is_hdg_boundary, eos*->is_eos and avfield->is_avfield, etc.

Models deriving ModelDefaults satisfy all traits (defaults provide every method), so
this is behavior-preserving; the value is precise concept diagnostics for hand-rolled
models and self-documenting each kernel's required surface.

No wire/ABI change. Verified: ctest 13/13 (facade compiles templated kernels);
coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
writeQoIHeader/writeQoIRow/writeQoIHeaderOnce (common.h) took the whole
commonstruct but use only common.qoiparams -> take const qoiparamsstruct&. 4 call
sites pass disc.common.qoiparams. Clean leaf narrowing (the helper's dependency is
now exactly the QoI concern).

Scoping note: the other Scan-2 candidates (MatVec, ComputeMinv, the geometry
helpers) are ORCHESTRATORS that forward common/cublasHandle to inner kernels, so
narrowing them isn't a clean leaf change (it cascades) -- the real bulk of
"stop passing common to things that don't need it" is the model-driver layer (S3),
where common is unpacked to base dims and forwarded to the Kokkos* kernels.

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per the directive to place intermediate structs (rather than passing the whole
commonstruct, or loose dim params, to the kernel/driver layer): group the
component counts of the DG fields (nc, ncu, ncq, ncw, nco, nch, ncx, ncs, nce,
ncm, ntau) into a nested componentsstruct member. This is the "field shape" the
kernels and *Driver() functions template their loops on, and the deferred C3
base-dimension group. 2246 access sites migrated common.X ->
common.components.X across backend/ and include/exasim/ (the hot nc/ncu/ncw/nco/ncx
fields); printinfo bare refs fixed. ncuext already lives in couplingparams.

Drivers/kernels now reach the component counts via the cohesive intermediate
struct; a follow-on can narrow driver signatures to take componentsstruct (+ a
grid-sizes struct) instead of commonstruct.

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… struct

Second intermediate dim struct: gridstruct (nd, elemtype, nodetype, porder,
pgauss, npe, npf, nge, ngf, np1d, ng1d, curvedMesh) -- the master-element shape
the kernels loop over. 1105 access sites migrated common.X -> common.grid.X across
backend/ and include/exasim/; printinfo bare refs fixed. (ppdegree left out -- it's
a preconditioner/solver param, not a grid size; appstruct's porder pointer is a
different field, untouched.)

With componentsstruct, the dims the kernels/drivers consume now live in two
cohesive intermediate structs (common.components + common.grid).

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Third intermediate dim struct: meshsizesstruct (maxnbc, ne, nf, nv, nfe, nbe, neb,
nbf, nfb, the nbe0/1/2 + nbf0/1 block splits, ne0/1/2, nf0) -- the mesh
element/face/block counts the kernels iterate over. 909 access sites migrated
common.X -> common.meshsizes.X across backend/ and include/exasim/; printinfo bare
refs fixed, including the eblks/fblks/nboufaces array-printing loops (bare nbe/nbf/
maxnbc as loop bounds, not caught by the single-arg printf pass). nse/nese/nfse/nnz
left out (superelement-CRS/preconditioner sizes, not mesh counts).

S3 outcome: rather than narrowing the ~200 drivers to loose dim params (verbose,
~160 risky lockstep twin signatures), the base dimensions the kernels/drivers
consume are now grouped into three cohesive INTERMEDIATE STRUCTS --
common.components (DG-field component counts), common.grid (reference-element
sizes), common.meshsizes (mesh partition counts) -- 4260 sites total. Drivers
reference these (common.components.nc, common.grid.nd, common.meshsizes.ne) instead
of loose god-struct fields.

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PTC convergence-monitor field is a solver-convergence artifact, not a
discretization quantity. Moved evalMonitor to CSolution (which owns disc as a
member, so it needs no disc& param -- cleaner than the preconditioner move). The
two callers in CSolution::SteadyProblem_PTC change disc.evalMonitor(...) ->
evalMonitor(...). Removed from CDiscretization (discretization.h/.cpp).

Mirrored in the postprocess twins for parity (postdiscretization -> postsolution,
no-driver_abi MonitorDriver variant). NB: the postprocess tool (postprocess.cpp)
is not a cmake build target, so those twins are not gated -- the mirror is
best-effort hygiene; the main path is verified.

No wire/ABI change. Verified (main path): ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes S4. The output field is an I/O concern, not a discretization quantity.
Moved evalOutput (the 58-line MPI-halo udg exchange + model OutputDriver call) to
CSolution (uses the owned disc's structs; no disc& param). Callers in
CSolution::SaveOutputCG (solution.cpp:1363,1374) change disc.evalOutput(...) ->
evalOutput(...). Removed from CDiscretization (discretization.h/.cpp).

Mirrored in the postprocess twins (postdiscretization -> postsolution, 4 callers,
no-driver_abi OutputDriver variant); postprocess is not a cmake target so that
mirror is best-effort. Main path verified.

With evalMonitor (f353af6), CDiscretization is shed of its two output/monitor
concerns; both now live in CSolution next to the other save/output methods.

No wire/ABI change. Verified (main path): ctest 13/13; coupling QoI byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e app

Two propagation gaps the `\bcommon\.` migration missed:
- hcommon.<field> (the GPU host-copy commonstruct in setstructs/discretization/
  postdiscretization, inside #ifdef HAVE_CUDA blocks the CPU gate doesn't compile):
  the perl word-boundary skipped the `hcommon` prefix. 6 refs migrated
  hcommon.{nfe,ne,...} -> hcommon.{meshsizes,grid,components}.X. A GPU build would
  otherwise fail to compile. (hcommon touches only dim fields; verified no
  earlier-group fields remain.)
- apps/library_example/solve_square: common.nc -> common.components.nc (the migration
  only scanned backend/ + include/).

Investigation result: text2code and the frontends (Matlab/Python/Julia) are
entirely commonstruct-free -- the generator emits pointwise math (dims passed as
params) and the driver scaffolding (ModelDrivers.cpp) is committed-static and was
already migrated; the frontends write the frozen app.bin. So the C3/S3 field
decomposition required no text2code/frontend code changes beyond these two gaps.

No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical (CPU; the
hcommon edits are GPU-only and mechanically identical to the gated migration).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wraith1995 and others added 30 commits July 3, 2026 11:10
…k export

Per the refined PETSc-export scope, the exported surface is residual + assembled matrix +
matrix-free + block matrices, with easy PETSc wiring -- and nothing else. Remove the Vanka /
PCFIELDSPLIT / DMPlex direction entirely:

- petsc.hpp: delete element_patches / vertex_patches (PCASM Vanka patch builders) and the now-unused
  <set> include. Operator (MatShell res.H + PCShell res.K), ShellMat (wrap any block/operator apply),
  make_mass_inverse, and assemble_matrix (MATAIJ) remain -- that is the whole exported surface. Block
  matrices stay available via export.hpp's element_blocks (u/q D/F/G/C/E/Mass/Minv views).
- petsc_poisson / petsc_heat: drop the (E,F) Vanka showcases; both keep A-D (PETSc residual, L2 error,
  matrix-free ShellMat == op.mat(), assembled MATAIJ == matrix-free). Verified: both PASS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…rd (⑤ phase 0)

First, boundary-only step of dstype/Int -> template-precision threading: establish the named
precision types and a compile-time safety guard at the EXPORTED interface, with no backend change
(dstype/Int are still global typedefs in ~692 files).

- export.hpp: exasim::floatTy / exasim::intTy aliases -- the canonical names consumers spell
  precision with, instead of the internal dstype / Int.
- petsc.hpp: static_assert(sizeof(PetscScalar)==sizeof(floatTy) && sizeof(PetscInt)==sizeof(intTy)) --
  a wrong-precision PETSc build (single vs double, 32- vs 64-bit indices) is now a compile error
  instead of a silent zero-copy ABI reinterpret. Add scalar_type/index_type members on Operator and
  ShellMat so a consumer can query the exported precision; ShellMat::Apply now spells floatTy.
- docs/internals/precision-threading.md: the phased plan for the full backend refactor (structs ->
  classes -> kernels/BLAS -> codegen -> cutover), each phase kept precision-preserving (rel_L2=0).

Verified: petsc_poisson, petsc_heat, and the robustness harness all still PASS with the guards in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… phase 1 start)

Phase 1 of dstype->template threading: make the core data structs template<class T=::dstype,
class I=::Int> so precision/index type can eventually be chosen at the type level. Uses the proven
low-risk pattern -- two member `using dstype=T; using Int=I;` aliases shadow the globals so each
struct body is BYTE-FOR-BYTE UNCHANGED, and `using X = XT<::dstype,::Int>;` keeps the old name meaning
exactly what it did. A 3-line edit per struct; nested struct members stay the default alias for now.

Structs done this batch: sysstruct, resstruct, tempstruct, scratcharenastruct.

Verified byte-identical two ways: (1) the robustness harness compiles them header-inline and PASSes;
(2) a full Exasim rebuild + run-app-regression.sh gives the SAME rel_L2 values as before (poisson3d
9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12, ...). docs/internals/precision-threading.md records the
pattern + remaining structs (solstruct, masterstruct, meshstruct, appstruct, commonstruct, ...).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…truct/commonstruct (⑤ phase 1 cont.)

Second Phase-1 batch of dstype->template threading: the remaining 6 core data structs get the same
shadow-typedef treatment (template<class T=::dstype, class I=::Int> + `using dstype=T; using Int=I;`
so the bodies are byte-for-byte unchanged, + `using X = XT<::dstype,::Int>;` alias). Applied via a
brace-depth script (each struct's matching `};` found by depth) to keep the edits mechanical + exact.

Structs done this batch: appstruct, masterstruct, meshstruct, solstruct, precondstruct, commonstruct
(commonstruct = the big one CDiscretization holds; its nested param-struct members stay default
aliases for now). 10 core structs now templated.

Verified byte-identical: robustness harness compiles them header-inline (ALL PASS) + full Exasim
rebuild + run-app-regression.sh gives the SAME rel_L2 (poisson3d 9.99e-11, isoq3d 1.95e-9,
poisson2d 2.29e-12). Remaining Phase-1 work (low-value int param structs + view_1d<T> + threading
T,I into nested members) tracked in docs/internals/precision-threading.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…ruct owner)

Phase 2 of dstype->template threading. CDiscretization is the struct OWNER (holds solstruct/resstruct/
commonstruct/... BY VALUE), so templating it is what actually propagates the Phase-1 <T,I> structs.
Unlike CResidual/CAssembler/CPreconditioner/CSolver (already <M> templates that only hold a REFERENCE
to it), it was a plain non-template class with out-of-line methods -- the hard case.

Now `template<class T=::dstype,class I=::Int> class CDiscretizationT`, with
`using CDiscretization = CDiscretizationT<::dstype,::Int>;`. Two mechanisms keep it precision-preserving:

- Shadow aliases inside the class (`using dstype=T; using Int=I;` + `using solstruct=solstructT<T,I>;`
  for every Phase-1 struct) so every member declaration and every out-of-line method body is
  BYTE-FOR-BYTE UNCHANGED yet resolves to the <T,I> instantiations.
- extern template (discretization.h) + explicit instantiation (discretization.cpp) for the
  backend-defined members (file ctor, dtor, finalizeConstruction, compGeometry, compMassInverse,
  DG2CG{,2,3}). This preserves the compile-once TU split: main.cpp builds CSolution<M> whose inline
  ctor/dtor construct/destroy a `CDiscretization disc` by value but never see those out-of-line
  bodies (they live only in the unity ExasimSolver.cpp). The in-memory ctors
  (discretization_inmemory.hpp, consumer-only) stay implicit -- consumer TUs see their defs. The 8
  `class CDiscretization;` forward decls became `template<class,class> class CDiscretizationT; using ...`.

Verified BOTH paths precision-preserving: consumer (in-memory ctor, pure implicit) robustness ALL
PASS; backend (file ctor via extern template -> library) full rebuild + main.cpp link + app-regression
at the SAME rel_L2 (poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Landed first full build.
Remaining Phase 2 (CResidual/CAssembler/CPreconditioner/CSolver + the exasim shim) tracked in
docs/internals/precision-threading.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…lver (⑤ phase 2 cont.)

Extend the FEM classes to <M, T=::dstype, I=::Int> so they carry the precision alongside the model.
Much easier than CDiscretization: these are already <M> templates (header-inline via the operators.hpp
aggregation, so consumers instantiate them locally -- no extern-template/explicit-instantiation TU
split) that only hold a REFERENCE to the owner. Same shadow-alias pattern (using dstype=T; using Int=I;
using CDiscretization=CDiscretizationT<T,I>; + sysstruct/precondstruct where used) keeps every member
decl and out-of-line body byte-identical; the only explicit edits are the cross-references between
them (CAssembler<M>->CAssembler<M,T,I> etc. in the solve-surface signatures) and the 15 forward
declarations across the backend (solver.h, preconditioner.h, nonlinearsolver.h, solutionwriter.h).

Gotcha fixed: CSolver::gmres returns Int, and an out-of-line return type is looked up in NAMESPACE
scope (not the class), so it had to be spelled `I` (the template param) to match the in-class
Int(=I); parameter types after the `CSolver<M,T,I>::` are in class scope and needed no change. The
ApplyPoly<M> free helpers in gmres.cpp deliberately stay <M> (their default-precision refs match
under T=dstype).

Verified precision-preserving: consumer robustness ALL PASS + full backend rebuild + main.cpp link +
app-regression through the real solver (gmres/PTC) at the SAME rel_L2 (naca0012 2.94e-13, poisson3d
9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Remaining Phase 2 = the include/exasim shim
(Operator/ShellMat/assemble_matrix take Scalar/Idx). See docs/internals/precision-threading.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…e (⑤)

Final Phase-2 piece: the include/exasim PETSc shim now carries the exported precision as template
parameters. Operator<M, Scalar=floatTy, Idx=intTy>, ShellMat<Scalar=floatTy>, and
make_mass_inverse<M,Scalar,Idx> / assemble_matrix<M,Scalar,Idx>. Operator holds
CDiscretizationT<Scalar,Idx>& + CAssembler<M,Scalar,Idx>& + CPreconditioner<M,Scalar,Idx>& +
sysstructT<Scalar,Idx>&, exposes scalar_type/index_type from the params, and the zero-copy Vec/Mat
wrapping reinterprets PetscScalar/PetscInt as Scalar/Idx.

The old GLOBAL ABI static_assert(sizeof(PetscScalar)==sizeof(floatTy)) is replaced by a
PER-INSTANTIATION static_assert(sizeof(PetscScalar)==sizeof(Scalar)) inside each wrapper -- it fires
only for the precision you actually instantiate, and including petsc.hpp no longer forces a TU-wide
precision match. floatTy IS dstype, so the default Operator<M> = Operator<M,dstype,Int> is
byte-identical; the threading makes precision an explicit, queryable property of the exported type.
Consumers spell bare `ShellMat` as `ShellMat<>` (now a template; CTAD can't deduce Scalar from the
apply lambda) -- fixed in robustness + petsc_poisson.

Verified (header-only shim -> backend app-regression unaffected): robustness ALL PASS + petsc_poisson
(SNES resid 9.08e-14, ShellMat vs matrix-free 0.0, MATAIJ vs matrix-free 2.65e-16) + petsc_heat
(PETSc TS drives the heat loop) -- all PASS, values identical to pre-shim.

Phase 2 complete: structs -> CDiscretization -> FEM classes -> PETSc shim all thread <T,I> with
dstype/Int as defaults; whole stack precision-parameterized + byte-identical under the defaults.
See docs/internals/precision-threading.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… start)

Phase 3 (kernels + BLAS) start. Introduce blas<T> -- a by-value, precision-dispatched CPU BLAS/LAPACK
trait that replaces the per-function `#ifdef USE_FLOAT S.. #else D..` branches with type dispatch
(blas<double> -> dgemm/dgetrf/dgetri/dgemv, blas<float> -> the s* symbols; primary template undefined
so an unsupported precision is a compile error). This is the pblas.h "choke point" the plan calls out:
it lets the pblas wrappers be templated on the scalar type T.

Converted this increment (templated on <class T = dstype>, CPU path via blas<T>): the GEMM node/Gauss
transforms cpuNode2Gauss / Node2Gauss / Gauss2Node, and the LAPACK inverse cpuComputeInverse. Under
the default T=dstype the trait selects exactly the routine the old #ifdef did -> byte-identical.
Callers still pass dstype* buffers, so T deduces to dstype (unchanged). The GPU (cublas/hipblas)
branches remain #ifdef USE_FLOAT-selected for now -- correct under the default, trait-ification for
non-default GPU precision is the remote-verified tail.

docs/internals/precision-threading.md now records the Phase-3 plan + THE CUT: the ExasimDriverABI
function-pointer boundary (include/driver_abi.hpp) is where the frontend-generated dstype kernels stay
frozen while everything below (solver, Discretization kernels, pblas BLAS) can thread T -- so the
Matlab/Python/Julia frontends need NO change. The T-vs-dstype-at-the-ABI decision (static_assert vs a
conversion shim) is documented and deferred.

Verified precision-preserving: robustness ALL PASS + full rebuild + app-regression at the SAME rel_L2
(naca0012 2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…tance A)

Adopt stance "A" for the precision cut at the ExasimDriverABI seam: the frontend-generated kernels are
hard-typed dstype behind the ABI function pointers, so the AbiAdapter branch of EXASIM_DRIVER_CALL
requires the caller's scalar type == the global dstype. Implemented as a forward-activating
static_assert(std::is_same_v<dstype, ::dstype>) in that branch:

- Today (kernels not yet precision-templated) `dstype` there is the global ::dstype -> assert trivially
  true -> byte-identical build.
- Once a kernel is templated with a `using dstype = T;` shadow (rest of Phase 3), the SAME assert
  automatically becomes the T==dstype guard -- a clear diagnostic instead of a raw T*->dstype* type
  error, and the single hook to later swap in a conversion shim (stance B) or Phase-4 templated
  codegen. Mixed precision stays reachable on the concrete-model path (exasim::Name<M>), which the
  PETSc export uses and which never touches the frozen ABI.

Frontends (Matlab/Python/Julia) need NO change: they keep emitting dstype kernels bound as
void(*)(dstype*,...) in ExasimDriverABI. docs/internals/precision-threading.md records A as chosen +
the deferred B/Phase-4 upgrade path (both localized to this one macro branch, so A locks in nothing).

Verified: full backend rebuild (compiles the AbiAdapter path) + app-regression at the SAME rel_L2
(poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). static_assert emits no code -> byte-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…change (⑤ phase 3)

The MPI companion to blas<T>. MPI datatypes were hardcoded two ways, one of them a live bug:
  - pblas.h reductions (PDOT/PGEMTV/PGEMTM): #ifdef USE_FLOAT MPI_FLOAT #else MPI_DOUBLE.
  - halo exchange (matvec/residual/residualeval/ldgblockjacobian/solution/postsolution/
    solutionwriter/setsysstruct): MPI_Isend/Irecv hardcoded MPI_DOUBLE even though the buffers
    (buffsend/buffrecv/tempn/tempg) are dstype* -> a single-precision (USE_FLOAT) build sent float
    buffers tagged MPI_DOUBLE, mis-typing every rank exchange. Pre-existing correctness bug.

Consolidate mpi_type<T>() into backend/Common/common.h (double->MPI_DOUBLE, float->MPI_FLOAT,
int->MPI_INT, long->MPI_LONG) -- there was already an identical copy buried late in parmetisexasim
(.hpp+.cpp), invisible to the earlier halo/reduction sites; removed the duplicate so there is one
definition, early enough for everyone. Collapsed the 3 pblas.h reduction #ifdef blocks and fixed 34
halo-exchange sites to mpi_type<dstype>(): byte-identical for the default double build, correct for
single precision, and forward-compatible (auto-becomes mpi_type<T>() once the comm helpers carry a
`using dstype=T` shadow, like the blas<T> trait + the seam static_assert). ParMETIS MPI_DOUBLE left
alone -- that is the partitioner's real_t weights, independent of dstype.

Verified: CPU-MPI (EXASIM_MPI=ON) full rebuild + multi-rank app-regression at unchanged rel_L2
(naca0012 2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9). GPU-MPI verification (dgx-b) next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…x-b V100)

The whole Phase 0-3 stack is now verified on GPU, not just CPU:
- petsc_poisson (EXASIM_BACKEND=2): exported shim solves HDG Poisson on a V100 (ShellMat vs
  op.mat()=0, MATAIJ 2.6e-16), matching CPU.
- petsc_poisson_mpi (mpiexec -n 1/2/4, backend=2): multi-rank halo exchange through mpi_type<dstype>()
  + ParMETIS partition, all PASS (SNES residual ~3e-12).

Records the dgx-b build/verify loop (sync.sh dgx -> build-dgx.sh -> build-petsc-gpu-consumer.sh /
build-run-petsc-gpumpi.sh) and the GPU-build-with-backend-0 unsupported-combo note. .claude/PLAN.md
refreshed to the current precision-threading state (was stale from the app-baseline task).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… 3 cont.)

Convert the remaining GEMM/GEMV pblas wrappers to template<class T=dstype> + blas<T>: Gauss2Node1
(accumulate gemm), PGEMNV (gemv), PGEMTV (transpose gemv), PGEMTM (transpose gemm) -- the hot
element-block matvec GEMMs. CPU branches now dispatch through blas<T>::gemm/gemv (alpha/beta derefed
into the by-value trait interface); the GPU cublas/hipblas branches are untouched (byte-identical on
GPU, still #ifdef USE_FLOAT-selected). Callers pass dstype* so T deduces to dstype -> identical.

Verified: robustness ALL PASS + full CPU-MPI rebuild + app-regression at the SAME rel_L2 (naca0012
2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). GPU unaffected (GPU branches not
touched; last GPU/GPU-MPI verify @6f8013df stands).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…ont.)

Template the 8 dstype-typed CPU primitives in cpuimpl.h -- cpuArraySetValue, cpuArrayInsert,
cpuArrayMin/Max, cpuElemGeom, cpuGetElemNodes, cpuApplyGivensRotation, cpuBackSolve -- as
template<class T=dstype> with a `using dstype=T;` body shadow (signatures retyped to T, bodies
unchanged). These are the CPU-side compute primitives the pblas Array ops + discretization geometry
call; callers pass dstype* so T deduces to dstype -> byte-identical. Pure CPU (no #ifdef), fully
local-verifiable.

Verified: robustness ALL PASS + full CPU-MPI rebuild + app-regression at the SAME rel_L2 (cone
1.92e-12, orion 4.31e-14, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…cont.)

The big one. Template all 137 dstype-signature functions in kokkosimpl.h -- the Kokkos device
compute kernels (flux/stabilization/geometry/array ops/thermo/wall-model + the 9
KOKKOS_INLINE_FUNCTION device helpers) -- as template<class Ty=dstype> with a `using dstype=Ty;`
body shadow (signatures retyped, bodies + KOKKOS_LAMBDA parallel_for bodies unchanged). Uses `Ty`
not `T` as the param name because several kernels already use `T` as an identifier (temperature:
`log(T)/T`, etc.). Callers pass dstype* so Ty deduces to dstype -> byte-identical. No functor
structs / no Kokkos::View<> (raw pointers), so the shadow-alias pattern applies uniformly.

Verified CPU: robustness ALL PASS + full rebuild + app-regression at the SAME rel_L2 (naca0012
2.94e-13, nsmach8 3.63e-15, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). GPU verification
(dgx-b, nvcc device-compiles the templated lambdas) next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…rified

The compute-primitive layer (pblas GEMM/GEMV via blas<T>, cpuimpl 8 primitives, kokkosimpl 137 device
kernels) now threads T/Ty. GPU-verified on dgx-b: nvcc device-compiles all 137 kokkosimpl templated
lambdas and petsc_poisson backend=2 is byte-identical (residual 9.119e-14, ShellMat 0, MATAIJ 2.62e-16).
Remaining Phase 3: the Discretization/*.hpp kernels, comm-helper using-dstype=T threading, and the
pblas tail (Array ops / dot / GPU trait methods).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…(⑤ phase 3 cont.)

Thread the precision/index template params through every orchestration free
function in backend/Discretization/*.hpp: signatures retype struct params to
XxxstructT<T,I>& and dstype*->T*, bodies open with `using dstype=T;` so bodies
are untouched (byte-identical by construction). Callers already hold
solstructT<T,I> members and call foo<M>(sol,...), so T/I deduce from the struct
args -- no call-site changes. This is what makes the stance-A seam static_assert
auto-activate into the real T==dstype guard (the shadow reaches EXASIM_DRIVER_CALL).

Files: uequation/matvec/qequation/wequation/residual/qoicalculation/massinv/
getuhat (compute kernels) + setstructs (in-memory struct setup for
CDiscretizationT<T,I> construction). IO/mesh-format helpers left alone.

Verified: full rebuild+install+app-regression byte-identical (naca0012 2.94e-13,
nsmach8 3.63e-15, poisson3d 9.99e-11, isoq3d 1.95e-9; 10 pass, 2 pre-existing
np=4 fails).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…er (dgx-b V100)

nvcc device-compiles the templated backend/Discretization/*.hpp headers into the
CUDA Exasim library and the petsc_poisson consumer; backend=2 run byte-identical
(residual 9.119e-14, ShellMat vs op.mat()=0, MATAIJ 2.622e-16).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… phase 3 cont.)

Extend the blas<T> trait with dot/copy/scal/axpy (s*/d* per specialization) and
template the remaining pblas wrappers on <class T=dstype> with a `using dstype=T;`
body shadow: Inverse, PDOT, PNORM (x2), DOT, ArrayCopy, ArrayMultiplyScalar,
ArrayAXPY, ArrayAXPBY, ArrayAX, NORM, PGEMNMStridedBached. Each CPU
`#ifdef USE_FLOAT S* #else D*` branch collapses to one blas<T>::... call
(byte-identical under default T=dstype); PDOT's MPI_Allreduce picks up
mpi_type<T>() via the shadow. GPU cublas/hipblas branches left #ifdef. Callers
pass dstype* so T deduces -- no call-site changes.

Verified: full rebuild+install+app-regression byte-identical (naca0012 2.94e-13,
nsmach8 3.63e-15, poisson3d 9.99e-11, isoq3d 1.95e-9; 10 pass, 2 pre-existing
np=4 fails).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
pblas tail nvcc-compiled into CUDA lib + petsc_poisson consumer, backend=2
byte-identical (9.119e-14 / 0 / 2.622e-16). PLAN.md now records the full
compute+orchestration layer as templated and verified on CPU/CPU-MPI/GPU/GPU-MPI;
only the blas<T> GPU trait methods (non-default GPU precision) remain in phase 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… test (⑤ phase 4, concrete-model path)

Thread the precision template param through the concrete-model kernel path so a
CONSUMER can instantiate the model + its Kokkos kernels at float32 against the
existing double-default install -- no Exasim rebuild. This is the payoff of the
dstype->T threading: precision is a type-level choice, not a build macro.

- include/exasim/drivers.hpp, include/exasim/kernels/*.hpp: *Driver<M>/*_kernel<M>
  now <M,T=dstype,I=Int>; storage buffers + gather/scatter locals retyped dstype/
  double -> T so a float instantiation is pure-float32 (byte-identical at T=dstype).
- tests/models/poisson2d.hpp: model is now a struct template Poisson2DT<T> with
  `using Poisson2D = Poisson2DT<double>` (methods stay non-template so the
  is_flux_model_v `&M::flux` trait still detects them; double path unchanged).
- tests/consumers/model_fp32: new consumer-side test running Poisson2D flux/source/
  qoi kernels at float AND double via Kokkos (CPU + GPU). CPU PASS: max
  |float-double|/|double| = 1.6e-07 (fp32), double vs analytic 3.2e-16.

The double petsc_poisson consumer recompiles byte-identical (9.084e-14 / 0 /
2.651e-16), confirming the concrete-model path change is precision-preserving.

NOTE: a full float32 SOLVE additionally needs the export/preprocessing boundary
(Preprocessed/make_preprocessed hold double structs) templated + a float PETSc for
the zero-copy PETSc path -- that is the remaining Phase-5 cutover.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… (CPU+GPU)

- tests/consumers/model_fp32/CMakeLists.txt: add EXASIM_GPU option so the test
  builds against the CUDA install (COMPONENTS gpu) with nvcc_wrapper, same as the
  petsc consumers.
- docs: record the concrete-model path (drivers/kernels/model) as float32-verified
  on CPU (Serial) + GPU (CUDA V100), max |float-double| = 1.6e-07; and document the
  Phase-5 boundary a full float32 solve still needs (Preprocessed double structs +
  zero-copy PetscScalar).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…on <T,I> (phase 5, no conversion)

Cross the export/preprocessing boundary WITHOUT a conversion shim: precision flows
as a template param through the whole in-memory bundle path.
- buildstructs.hpp: Preprocessed -> struct template PreprocessedT<T=dstype,I=Int>
  (holds XxxstructT<T,I>); the 6 struct builders + the 2 double-array malloc helpers
  are <T,I> and cast the double Mesh/Master intermediates into T (the intermediates
  stay double -- only the materialized solve-precision structs are T).
- preprocessing.h/.hpp: CPreprocessing::take()/takeParallel() are member templates
  <T=dstype,I=Int> returning PreprocessedT<T,I> and calling buildXStruct<T,I>.
- export.hpp: make_preprocessed<M,T=floatTy,I=intTy> returns PreprocessedT<T,I>.
- discretization.h/_inmemory.hpp: in-memory ctor param Preprocessed&& -> PreprocessedT<T,I>&&.

Defaults reproduce the old dstype/Int signatures exactly -> double path byte-identical
(petsc_poisson consumer 9.084e-14 / 0 / 2.651e-16, unchanged).

REMAINING for a full float SOLVE (Phase 5 backend cutover): the in-memory ctor's
setup chain -- cpuInitInMemory + setstructs.cpp (setcommonstruct/cpuInitSetup/
setAppRuntimeContext) + connectivity buildConn -- and the CDiscretizationT backend
members (finalizeConstruction/compGeometry/compMassInverse/DG2CG) are compiled/
instantiated at dstype only in the unity build; a float disc needs them at float.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…se 5 cutover)

The consumer aggregates the backend source (operators.hpp #includes discretization.cpp
-> connectivity.cpp + setstructs.cpp), so templating those .cpp lets a consumer
instantiate CDiscretizationT<T> from source -- no lib rebuild, no float explicit
instantiation (the extern template only pins dstype).

- setstructs.cpp: setcommonstruct/cpuInitSetup/setAppRuntimeContext/cpuInit/dev*/gpuInit
  etc. -> <T=dstype,I=Int> (shadow-alias). connectivity.cpp: buildConn + 32 more.
- preprocessing.cpp: take()/takeParallel() -> member templates (mirror preprocessing.hpp),
  matching the templated CPreprocessing decl.
- discretization_inmemory.hpp: cpuInitInMemory + its cpuInitSetup/setAppRuntimeContext
  forward-decls -> templates.
- readbinaryfiles.cpp: readInput's setAppRuntimeContext forward-decl -> template.
- Fixed thread_disc regex miss: xiny2 (template<typename T>, no space) had a spurious
  2nd template head injected; removed.

Verified: full lib rebuild + app-regression 12/12 PASS (isoq/sharpb2 np=4, previously
RUN_FAIL, now build+run from the consistent source; all rel_L2 << 1e-8, within the
harness's documented ~1e-9 FP-reorder band). Double petsc_poisson consumer byte-identical
(9.084e-14 / 0 / 2.651e-16).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…s/geometry helpers)

Continue the phase-5 cutover toward a full float32 solve as a consumer:
- setsysstruct.cpp: setsysstruct + 2 -> <T=dstype,I=Int>.
- export.hpp: recover_volume / eval_qoi<M> -> <T,I> (so a float disc's volume field +
  QoI L2 error are reachable).
- connectivity.cpp: fetch_node / equal_row_tol -> template<class T> (coord helpers).
- ismeshcurved.cpp: IsMeshCurved -> <T,I>.
- discretization.cpp: BuildElementBlockBoundaryFaces / AllocateLDGBlockJacobianMemory -> <T,I>.

All byte-identical for the default dstype path: petsc_poisson consumer still PASSES
(9.084e-14 / 0 / 2.651e-16), and the full app-regression is 12/12.

SCOPE FINDING: a full float32 *solve* consumer (instantiating CDiscretizationT<float>
end-to-end) surfaces ~167 precision-mixing sites across the whole backend -- ArraySetValue
(48), PGEMNMStridedBached (34), ArrayAXPBY, the geometry (ElemGeom/FaceGeom), every model
kernel, the residual (Ru/Rq Elem/Face), the preconditioner, and the non-templated CWallModel
class. Each is a templated caller passing a double literal/constant mixed with T buffers.
This is a backend-wide cutover (not "one lib rebuild"); the preprocessing + in-memory setup
boundary is done + verified, and this is the remaining, larger phase-5 body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…ve = ~167-site backend cutover

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… works via templates

The ~167 precision-mixing sites are resolved; a consumer now instantiates
CDiscretizationT<float> end-to-end and runs a full HDG Poisson SOLVE, choosing
precision purely via templates (no Exasim rebuild, no conversion shim). Root fixes,
by category:

- noDeduce_t<T> (common.h): scalar function params spelled noDeduce_t<T> deduce T only
  from the buffer args, so a double literal/constant (0.0, `zero`, `minusone`, a
  tolerance) passed next to a float* buffer converts instead of forcing a conflicting
  deduction. Applied to the pblas/kokkosimpl array+blas ops (ArraySetValue,
  PGEMNMStridedBached, ArrayAXPBY/AXY/AXPB, UpdateUDG, ApplyDtcoef, StgInflow*, ...).
- drivers.hpp: *_kernel<M>(...) -> <M,T>; compute_shape templated; the interface
  Fint/Fext forwarders deduce the scalar from the first buffer arg and compile out for
  non-default precision (coupling is double-only; single-model solves never hit an
  interface face).
- kernels/*.hpp: retype the gather/scatter locals double->T in eos/sourcew/tdfunc/
  interface/output/visualization; init.hpp InitFn -> InitFn_t<T> (+ init_dispatch reorder).
- model.hpp: the ModelDefaults hierarchy (8 mixins) + zero_fill templated on scalar, so
  a float model's inherited defaults (initudg/sourcew/...) are float-typed; poisson2d.hpp
  inherits ModelDefaults<Poisson2DT<T>, T>.
- geometry.hpp (the .hpp is the live one), uresidual/qresidual.hpp, matvec.hpp
  (diag_blocks), connectivity.cpp (make_cell_key/xiny/fetch_node/equal_row_tol/permindex
  coord locals), discretization.cpp (crs_init, Build*/Allocate* helpers; wall-model branch
  if-constexpr-guarded to dstype), setprecondstruct.cpp -> all <T,I>.
- common.h: qoiparamsstruct/wallmodelparamsstruct -> templated sub-structs.

tests/consumers/solve_fp32: builds the in-memory HDG Poisson stack at double AND float,
condenses to H*uh=b, LAPACK-solves, and compares the float trace solution to the double
one. PASS: ||uh_f32 - uh_f64|| / ||uh_f64|| = 1.01e-4 (fp32 dense-LU accuracy).

Double path fully intact: full lib rebuild + app-regression 12/12 PASS (unchanged
rel_L2 << 1e-8); petsc_poisson consumer byte-identical.

KNOWN FOLLOW-UP: the float32 eval_qoi volume-integral readback underflows to ~0
(qoi[1]=1e-19 vs 0.405) -- a post-processing quadrature-precision quirk; the SOLVE
itself is correct (float ||uh|| matches double to fp32). Tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…known qoi follow-up)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…OT read as garbage)

blas<float>::dot called the BLAS sdot_, but on several platforms (macOS Accelerate,
f2c/CLAPACK) the F77 single-precision REAL function sdot_ returns a *double* in the
ABI while common.h declares `float SDOT(...)`, so the return was misread as garbage
(~1e-19). This silently zeroed every single-precision PDOT/DOT/NORM -> the float32
eval_qoi volume integral read ~0 (qoi[1]=1e-19 instead of 0.405), even though the
solve itself was correct.

Fix: compute blas<float>::dot with a manual loop (accumulate in double for a
well-conditioned reduction), sidestepping the sdot_ ABI landmine. DDOT (double) is
untouched, so the default path is byte-identical.

Now the float32 solve_fp32 QoI is correct: qoi[1]=0.405249 (vs double 0.405285,
fp32 precision), qoi[0]=2.45e-9. Consumer still PASS (uh vs double = 1.01e-4).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…one, sdot_ ABI bug fixed

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant