Teoc properly separate out#28
Draft
wraith1995 wants to merge 127 commits into
Draft
Conversation
…tatestruct Carve the iterative-solver and reduced-basis preconditioner transient state out of the commonstruct god-struct into a named, nested solverstatestruct member. Eight mutable fields move: RBcurrentdim, RBremovedind, Wcurrentdim, linearSolverIter, nonlinearSolverIter, linearSolverTolFactor, linearSolverRelError, PTCparam. 142 access sites migrated common.X -> common.solverstate.X across both the ABI (.cpp) and template (.hpp) implementations; commonstruct::printinfo bare references fixed. This names and isolates the solver's mutable runtime state from the read-only configuration it used to ride inside (first cut of the concern-separation refactor; the pervasive time-stepping state time/currentstep/currentstage/ dtfactor is deferred to a TimeState group in C3). No wire-format or driver-ABI change. Verified: frontend+consumer ctest 13/13; poisson coupling QoI byte-identical (1.806679e-01 3.871878e-01). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First step of the "QoI template, many instances" model. Add qoiinstancestruct (name, kind, boundary, offset, ncomp) and a commonstruct.qoiinstances list; setstructs synthesizes the historical default (one Domain instance over all volume QoIs, one Boundary instance over all surface QoIs on common.ibs). The QoI column headers and value rows are now written by shared instance-driven helpers writeQoIHeader/writeQoIRow, replacing the six hand-rolled nvqoi/nsurf loops across both the ABI (.cpp) and template (.hpp) solution/postsolution paths. Compute (qoiElement/qoiFace) is unchanged here; this is the data-model + output seam that lets a program register multiple named QoI instances. Default output is byte-identical. Verified: ctest 13/13; poisson coupling QoI byte-identical (Domain_QoI1=1.806679e-01, Domain_QoI2=3.871878e-01, Boundary_QoI1=0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Complete the "QoI template, many instances" model with a user-facing handle: ExasimSolver::AddQoI(modelindex, name, kind, boundary, offset, ncomp) and ClearQoI(modelindex), callable after Initialize and before Solve. A program can now register several independently-named QoIs, each owning a slice of the model QoI-kernel output; out-of-range slices are rejected so blind registration is safe. The QoI header line is now written lazily on the first SaveQoI row (writeQoIHeaderOnce) instead of at file-open, so instances registered after model init are reflected in the columns (byte-identical for the default instances). The builtin consumer gains an EXASIM_QOI_DEMO path that registers three named instances (KineticEnergy, Dissipation, WallFlux) purely through the public API; output header becomes "KineticEnergy1 Dissipation1 WallFlux1" vs the default "Domain_QoI1 Domain_QoI2 Boundary_QoI1", same values. Deferred (tracked): per-instance multi-boundary compute (needs multi-ib QoI kernels from the frontend) and MPI_Allreduce of QoI values (collective-safety review). Verified: ctest 13/13; coupling QoI byte-identical; demo confirmed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First twin unified. The non-templated ABI qoicalculation.cpp is removed; the templated qoicalculation.hpp is now the single source, instantiated with exasim::detail::AbiAdapter on the production path and with a concrete model M on the compile-time-specialized path (facade consumer). Mechanism (reusable for the remaining 24 twins): - commonstruct gains an ExasimDriverABI* driver_abi, set by CDiscretization, so the templated FEM code can reach the runtime ABI without threading it. - model_drivers_abi.cpp gains no-driver_abi overloads of QoIvolumeDriver/ QoIboundaryDriver that recover the ABI from common.driver_abi and forward to the explicit-ABI versions; the EXASIM_DRIVER_CALL AbiAdapter branch binds to these. - non-templated callers (solution.cpp, postsolution.cpp) call qoiElement/qoiFace with explicit <AbiAdapter>. - install/CMakeLists.txt puts the repo root on the in-tree include path so the <exasim/...> umbrella headers resolve (matching out-of-tree consumers). Verified: build green; ctest 13/13 (incl. facade template-path consumer); coupling QoI byte-identical (1.806679e-01 / 3.871878e-01). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…vers Scaffolding for the twin sweep. Adds inline no-driver_abi overloads for the 36 solve-loop kernel drivers (Flux/Source/Fbou/Ubou/Fhat/Uhat/Tdfunc/Avfield/EoS/ Vis*/QoI*/Fint/Fext/Output/Monitor/... and their Jacobian variants) that recover the model ABI from common.driver_abi and forward to the explicit-ABI versions. These bind the EXASIM_DRIVER_CALL AbiAdapter branch, so unifying each twin onto its templated .hpp no longer needs per-file overloads -- only an include switch, <AbiAdapter> on non-templated callers, and deleting the .cpp. Generated mechanically from the explicit-ABI signatures (uniform: every solve-loop driver takes commonstruct& common). The 10 init-path drivers (Init*/cpuInit*) take no common and are handled when their files are unified. Replaces the two hand-written QoI overloads from the pilot. No behavior change (overloads are additive; only qoicalculation uses them so far). Verified: build green; ctest 13/13. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove getuhat.cpp; getuhat.hpp is the single source (GetUhat/GetdUhat/UhatBlock/ dUhatBlock). residual.cpp now includes the .hpp; its 11 GetUhat/GetdUhat call sites plus 3 in ldgblockjacobian.cpp and 1 in discretization.cpp now use explicit <exasim::detail::AbiAdapter> (dropping the driver_abi argument; UbouDriver is served by the no-driver_abi overload). Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove qresidual.cpp; templated qresidual.hpp is the single source (RqElem/RqFace/ dRqElem/dRqFace + blocks). residual.cpp includes the .hpp and its calls use explicit <exasim::detail::AbiAdapter>. qresidual uses no kernel drivers. ctest 13/13; coupling byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Remove uresidual.cpp; templated uresidual.hpp is the single source. Includers switched to the .hpp; non-templated callers use <exasim::detail::AbiAdapter> (kernel drivers served by the no-driver_abi overloads). ctest 13/13; coupling byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d sources The solve-path files are mutually recursive via cross-file <M> template calls (matvec->uequation/residual, uequation->wequation, residual->wequation/qequation), so they can't be unified independently -- landed as one atomic cluster: remove wequation/qequation/uequation/residual/matvec/massinv .cpp; the templated .hpp are the single source; non-templated callers (solution, discretization, postdiscretization, ldgblockjacobian, gmres, ptcsolver, preconditioner, ...) use <exasim::detail::AbiAdapter> with the driver_abi argument dropped. Fixes uncovered en route: - EXASIM_LEGACY_W_CALL now dispatches the HDG w-equation source through common.driver_abi for the AbiAdapter path (was a bare global call that resolved to a per-model kernel, wrong for runtime ABI dispatch); wequation.hpp now includes the dispatch header so the macro is defined. - Reconciled twin drift: restored EnsureTemplateAllocation (lived in qequation.cpp, absent from .hpp) into qequation.hpp. - massinv is a non-template helper, not a template twin: its .cpp carried an unused driver_abi param; callers now use massinv.hpp's signature (no driver_abi, no template argument). - caller rewriter now strips qualified disc.driver_abi / this->driver_abi too. Verified: ctest 13/13; coupling QoI byte-identical (1.806679e-01 / 3.871878e-01). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The backend class .hpp (discretization, solution, solver, preconditioner, visualization) and their .hpp helpers (postdiscretization, postsolution, avsolution, previoussolutions, timestepcoeff, updatesolution, updatesource, gmres, ptcsolver, getpoly, postsolver, setsysstruct, postpreconditioner, applymatrix, setprecondstruct) are DEAD: nothing in the cmake build includes them (production compiles the non-template .cpp/.h classes; the live compile-time-specialized template path is the separate include/exasim/ stack used by the facade via <exasim/model.hpp>). Proven: the library builds with #error probes in all five class .hpp and zero hits; every includer of these files is itself in the dead set. These are an abandoned earlier template attempt -- the very ABI/template twin duplication this refactor targets -- and several don't even compile (they define CFoo<M>:: methods against a non-template class CFoo). Removing them eliminates the stale duplication for the class layer (the free-function layer was unified onto its templates earlier). Production classes stay non-template (runtime ABI dispatch), which already satisfies "CSolver/CPreconditioner are not templated on the model". 20 files removed. No build impact: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last two true ABI/template twins. pointlocator.hpp/pointwallmodel.hpp are dead (in no depfile; included only by each other). Removing them completes the twin handling. No build impact: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…monstruct Group the mutable time-stepping runtime fields (currentstep, currentstage, dtfactor, time) into a nested timestatestruct member, alongside C1's solverstate. Together they isolate the simulation's mutable runtime state from the read-only configuration in the god-struct. 413 access sites migrated common.X -> common.timestate.X across backend/ and include/exasim/ (the live template stack, compiled via the unified free-function headers); printinfo bare refs fixed. First commonstruct-decomposition step on the now-unified (post-U) single codebase. No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the physics/model configuration (appname, source, convStabMethod, diffStabMethod, rotatingFrame, viscosityModel, SGSmodel, ALEflag, ncAV, AVsmoothingIter, frozenAVflag, AVdistfunction, rampFactor, tau0) into a nested physicsparamsstruct member. 75 access sites migrated common.X -> common.physicsparams.X across backend/ and include/exasim/; printinfo bare refs fixed. (read_uh deferred: it collides with an appstruct field.) No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the iterative-solver configuration (linearSolver, nonlinearSolver, linear/nonlinearSolverMaxIter, matvecOrder, gmresRestart, gmresOrthogMethod, preconditioner, precMatrixType, ptcMatrixType, RBdim, Wdim, matvecTol, linear/nonlinearSolverTol) into a nested solverparamsstruct member. 96 access sites migrated common.X -> common.solverparams.X across backend/ and include/exasim/; printinfo bare refs fixed. (Mutable per-solve counters stay in solverstate from C1.) No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the time-integration / problem-evolution configuration (temporalScheme, torder, tstages, tsteps, dae_steps, dae_alpha/beta/gamma/epsilon, tdep, wave, tdfunc, linearProblem, subproblem) into a nested timeparamsstruct member. 248 access sites migrated common.X -> common.timeparams.X across backend/ and include/exasim/; printinfo bare refs fixed. (Mutable step/stage counters stay in timestate; dt/dae_dt are appstruct raw-wire pointers, untouched.) No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the QoI / visualization-output configuration (nsca, nvec, nten, nsurf, nvqoi, saveParaview, ibs, the qoivolume/qoisurface accumulation buffers, and the C2 qoiinstances vector) into a nested qoiparamsstruct member. 109 access sites migrated common.X -> common.qoiparams.X across backend/ and include/exasim/; the writeQoI* helpers + freememory() + printinfo bare refs fixed. Folds the C2 instance machinery under the same concern. No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the interface-coupling configuration (ncuext, coupledinterface, coupledcondition, coupledboundarycondition, FextCall, ncie, extUhat, extFhat, extStab, ninterfacefaces, ndofuhatinterface, nintfaces, nvindx) into a nested couplingparamsstruct member. 135 access sites migrated common.X -> common.couplingparams.X across backend/ and include/exasim/; printinfo bare refs fixed. The raw interface arrays (vindx/interfacefluxmap/intepartpts) are shared with appstruct and left in place; wall-model and synthetic-turbulence fields are separate concerns (left for their own groups). No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the output/checkpoint/IO configuration (saveSolFreq, saveSolOpt, saveRestart, saveSolBouFreq, timestepOffset, compudgavg, readudgavg, saveResNorm, fileoffset, debugMode) into a nested outputparamsstruct member. 228 access sites migrated common.X -> common.outputparams.X across backend/ and include/exasim/; printinfo bare refs fixed. No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two small feature concerns: wallmodelparamsstruct (nwm, szwmModelIDs, szwmBoundaries, szwmDistances, wmModelIDs, wmBoundaries, wmDistances) and stgparamsstruct (stgNmode, nstgib, stgib). 38 access sites migrated common.X -> common.wallmodelparams.X / common.stgparams.X across backend/ and include/exasim/; printinfo + freememory bare refs fixed (region-guarded so the appstruct-mirrored raw tables are untouched). No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group the ~17 degree-of-freedom counts (ndof, ndofq, ndofw, ndofuhat, ndofudg, ndofsdg, ndofodg, ndofedg, ndofbou, ndofucg, and the ndof*1 interior variants) into a nested sizesstruct member -- the derived "sizes view": cross-products of the base dimensions, computed once in setstructs and read throughout. 225 access sites migrated common.X -> common.sizes.X across backend/ and include/exasim/; printinfo bare refs fixed. Kept as STORED (not recomputed on access): ndof1* switch between ne1 and ne by mesh branch, and ndofucg/ndofbou are not simple products, so a naive computed view would be wrong. Base dimensions (nd, nc-family, npe, ne, ...) stay top-level as the fundamental shape. Completes C3: commonstruct now opens with 11 nested concern groups (solverstate, timestate, physicsparams, solverparams, timeparams, qoiparams, couplingparams, outputparams, wallmodelparams, stgparams, sizes). No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The LDG block-Jacobi preconditioner computation lived in CDiscretization but is a preconditioner responsibility. Moved the method to CPreconditioner (it takes the discretization by reference and reads disc.sol/res/app/master/mesh/tmp/common, exactly as before). BlockJacobianLDG/mpiBlockJacobianLDG remain in ldgblockjacobian.cpp and stay visible via the unity build (ExasimSolver.cpp includes discretization.cpp before preconditioner.cpp). Callers in CSolution::DIRK updated disc.ComputeLDGPreconditioner(...) -> prec.ComputeLDGPreconditioner(disc, ...). First C4 re-homing. (res.K is still shared scratch in resstruct -- moving the preconditioner matrix storage into precondstruct is a deeper follow-up.) No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the commonstruct C3 decomposition in the model templating system. The
monolithic CRTP base ModelDefaults<Self> (model.hpp) provided zero/identity
defaults for every concern in one struct; split it into a chain of per-concern
default mixins, each adding ONE concern's surface:
ModelConstants<Self> (disc, nco)
-> VolumeDefaults<Self> (source, sourcew, tdfunc, avfield)
-> BoundaryDefaults<Self> (ubou, fbou, fbou_hdg, fhat, uhat, stab)
-> EoSDefaults<Self> (eos, eos_du, eos_dw)
-> InitDefaults<Self> (initq, initwdg, initudg, initodg)
-> OutputDefaults<Self> (vis_*, qoi_*, monitor, output)
-> HDGJacobianDefaults<Self> (the HDG pointwise Jacobians)
ModelDefaults<Self> : HDGJacobianDefaults<Self> {} // composite, public API
Prerequisite: extracted the zero_fill_ member helper into a free
detail::zero_fill<N>() so the mixins can call it without depending on a shared
base (avoids two-phase lookup on a dependent base). Users still derive from
ModelDefaults<Self> and inherit every default unchanged; each default now lives
in the mixin for its concern, and a kernel/class can name just the mixin whose
surface it needs. All defaults reference constants via Self:: only, so the chain
is behavior-preserving.
No wire/ABI change. Verified: ctest 13/13 (facade consumer compiles the templated
model path); coupling QoI byte-identical.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s concern Propagate the model split to the kernels. Added void_t method detectors + 13 per-capability concept traits in model.hpp (is_flux/source/sourcew/tdfunc/avfield/ eos/boundary/hdg_boundary/interface/init/qoi/output/vis_model_v), each = is_model_v && the methods that concern's kernels actually call. Replaced the blanket static_assert(is_model_v<M>) in every kernel (include/exasim/kernels/*.hpp) with the narrow trait for the concern it consumes: flux->is_flux, boundary fbou/ubou->is_boundary and the hdg_fbou variants->is_hdg_boundary, eos*->is_eos and avfield->is_avfield, etc. Models deriving ModelDefaults satisfy all traits (defaults provide every method), so this is behavior-preserving; the value is precise concept diagnostics for hand-rolled models and self-documenting each kernel's required surface. No wire/ABI change. Verified: ctest 13/13 (facade compiles templated kernels); coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
writeQoIHeader/writeQoIRow/writeQoIHeaderOnce (common.h) took the whole commonstruct but use only common.qoiparams -> take const qoiparamsstruct&. 4 call sites pass disc.common.qoiparams. Clean leaf narrowing (the helper's dependency is now exactly the QoI concern). Scoping note: the other Scan-2 candidates (MatVec, ComputeMinv, the geometry helpers) are ORCHESTRATORS that forward common/cublasHandle to inner kernels, so narrowing them isn't a clean leaf change (it cascades) -- the real bulk of "stop passing common to things that don't need it" is the model-driver layer (S3), where common is unpacked to base dims and forwarded to the Kokkos* kernels. No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per the directive to place intermediate structs (rather than passing the whole commonstruct, or loose dim params, to the kernel/driver layer): group the component counts of the DG fields (nc, ncu, ncq, ncw, nco, nch, ncx, ncs, nce, ncm, ntau) into a nested componentsstruct member. This is the "field shape" the kernels and *Driver() functions template their loops on, and the deferred C3 base-dimension group. 2246 access sites migrated common.X -> common.components.X across backend/ and include/exasim/ (the hot nc/ncu/ncw/nco/ncx fields); printinfo bare refs fixed. ncuext already lives in couplingparams. Drivers/kernels now reach the component counts via the cohesive intermediate struct; a follow-on can narrow driver signatures to take componentsstruct (+ a grid-sizes struct) instead of commonstruct. No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… struct Second intermediate dim struct: gridstruct (nd, elemtype, nodetype, porder, pgauss, npe, npf, nge, ngf, np1d, ng1d, curvedMesh) -- the master-element shape the kernels loop over. 1105 access sites migrated common.X -> common.grid.X across backend/ and include/exasim/; printinfo bare refs fixed. (ppdegree left out -- it's a preconditioner/solver param, not a grid size; appstruct's porder pointer is a different field, untouched.) With componentsstruct, the dims the kernels/drivers consume now live in two cohesive intermediate structs (common.components + common.grid). No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Third intermediate dim struct: meshsizesstruct (maxnbc, ne, nf, nv, nfe, nbe, neb, nbf, nfb, the nbe0/1/2 + nbf0/1 block splits, ne0/1/2, nf0) -- the mesh element/face/block counts the kernels iterate over. 909 access sites migrated common.X -> common.meshsizes.X across backend/ and include/exasim/; printinfo bare refs fixed, including the eblks/fblks/nboufaces array-printing loops (bare nbe/nbf/ maxnbc as loop bounds, not caught by the single-arg printf pass). nse/nese/nfse/nnz left out (superelement-CRS/preconditioner sizes, not mesh counts). S3 outcome: rather than narrowing the ~200 drivers to loose dim params (verbose, ~160 risky lockstep twin signatures), the base dimensions the kernels/drivers consume are now grouped into three cohesive INTERMEDIATE STRUCTS -- common.components (DG-field component counts), common.grid (reference-element sizes), common.meshsizes (mesh partition counts) -- 4260 sites total. Drivers reference these (common.components.nc, common.grid.nd, common.meshsizes.ne) instead of loose god-struct fields. No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PTC convergence-monitor field is a solver-convergence artifact, not a discretization quantity. Moved evalMonitor to CSolution (which owns disc as a member, so it needs no disc& param -- cleaner than the preconditioner move). The two callers in CSolution::SteadyProblem_PTC change disc.evalMonitor(...) -> evalMonitor(...). Removed from CDiscretization (discretization.h/.cpp). Mirrored in the postprocess twins for parity (postdiscretization -> postsolution, no-driver_abi MonitorDriver variant). NB: the postprocess tool (postprocess.cpp) is not a cmake build target, so those twins are not gated -- the mirror is best-effort hygiene; the main path is verified. No wire/ABI change. Verified (main path): ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes S4. The output field is an I/O concern, not a discretization quantity. Moved evalOutput (the 58-line MPI-halo udg exchange + model OutputDriver call) to CSolution (uses the owned disc's structs; no disc& param). Callers in CSolution::SaveOutputCG (solution.cpp:1363,1374) change disc.evalOutput(...) -> evalOutput(...). Removed from CDiscretization (discretization.h/.cpp). Mirrored in the postprocess twins (postdiscretization -> postsolution, 4 callers, no-driver_abi OutputDriver variant); postprocess is not a cmake target so that mirror is best-effort. Main path verified. With evalMonitor (f353af6), CDiscretization is shed of its two output/monitor concerns; both now live in CSolution next to the other save/output methods. No wire/ABI change. Verified (main path): ctest 13/13; coupling QoI byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e app
Two propagation gaps the `\bcommon\.` migration missed:
- hcommon.<field> (the GPU host-copy commonstruct in setstructs/discretization/
postdiscretization, inside #ifdef HAVE_CUDA blocks the CPU gate doesn't compile):
the perl word-boundary skipped the `hcommon` prefix. 6 refs migrated
hcommon.{nfe,ne,...} -> hcommon.{meshsizes,grid,components}.X. A GPU build would
otherwise fail to compile. (hcommon touches only dim fields; verified no
earlier-group fields remain.)
- apps/library_example/solve_square: common.nc -> common.components.nc (the migration
only scanned backend/ + include/).
Investigation result: text2code and the frontends (Matlab/Python/Julia) are
entirely commonstruct-free -- the generator emits pointwise math (dims passed as
params) and the driver scaffolding (ModelDrivers.cpp) is committed-static and was
already migrated; the frontends write the frozen app.bin. So the C3/S3 field
decomposition required no text2code/frontend code changes beyond these two gaps.
No wire/ABI change. Verified: ctest 13/13; coupling QoI byte-identical (CPU; the
hcommon edits are GPU-only and mechanically identical to the gated migration).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k export Per the refined PETSc-export scope, the exported surface is residual + assembled matrix + matrix-free + block matrices, with easy PETSc wiring -- and nothing else. Remove the Vanka / PCFIELDSPLIT / DMPlex direction entirely: - petsc.hpp: delete element_patches / vertex_patches (PCASM Vanka patch builders) and the now-unused <set> include. Operator (MatShell res.H + PCShell res.K), ShellMat (wrap any block/operator apply), make_mass_inverse, and assemble_matrix (MATAIJ) remain -- that is the whole exported surface. Block matrices stay available via export.hpp's element_blocks (u/q D/F/G/C/E/Mass/Minv views). - petsc_poisson / petsc_heat: drop the (E,F) Vanka showcases; both keep A-D (PETSc residual, L2 error, matrix-free ShellMat == op.mat(), assembled MATAIJ == matrix-free). Verified: both PASS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…rd (⑤ phase 0) First, boundary-only step of dstype/Int -> template-precision threading: establish the named precision types and a compile-time safety guard at the EXPORTED interface, with no backend change (dstype/Int are still global typedefs in ~692 files). - export.hpp: exasim::floatTy / exasim::intTy aliases -- the canonical names consumers spell precision with, instead of the internal dstype / Int. - petsc.hpp: static_assert(sizeof(PetscScalar)==sizeof(floatTy) && sizeof(PetscInt)==sizeof(intTy)) -- a wrong-precision PETSc build (single vs double, 32- vs 64-bit indices) is now a compile error instead of a silent zero-copy ABI reinterpret. Add scalar_type/index_type members on Operator and ShellMat so a consumer can query the exported precision; ShellMat::Apply now spells floatTy. - docs/internals/precision-threading.md: the phased plan for the full backend refactor (structs -> classes -> kernels/BLAS -> codegen -> cutover), each phase kept precision-preserving (rel_L2=0). Verified: petsc_poisson, petsc_heat, and the robustness harness all still PASS with the guards in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… phase 1 start) Phase 1 of dstype->template threading: make the core data structs template<class T=::dstype, class I=::Int> so precision/index type can eventually be chosen at the type level. Uses the proven low-risk pattern -- two member `using dstype=T; using Int=I;` aliases shadow the globals so each struct body is BYTE-FOR-BYTE UNCHANGED, and `using X = XT<::dstype,::Int>;` keeps the old name meaning exactly what it did. A 3-line edit per struct; nested struct members stay the default alias for now. Structs done this batch: sysstruct, resstruct, tempstruct, scratcharenastruct. Verified byte-identical two ways: (1) the robustness harness compiles them header-inline and PASSes; (2) a full Exasim rebuild + run-app-regression.sh gives the SAME rel_L2 values as before (poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12, ...). docs/internals/precision-threading.md records the pattern + remaining structs (solstruct, masterstruct, meshstruct, appstruct, commonstruct, ...). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…truct/commonstruct (⑤ phase 1 cont.) Second Phase-1 batch of dstype->template threading: the remaining 6 core data structs get the same shadow-typedef treatment (template<class T=::dstype, class I=::Int> + `using dstype=T; using Int=I;` so the bodies are byte-for-byte unchanged, + `using X = XT<::dstype,::Int>;` alias). Applied via a brace-depth script (each struct's matching `};` found by depth) to keep the edits mechanical + exact. Structs done this batch: appstruct, masterstruct, meshstruct, solstruct, precondstruct, commonstruct (commonstruct = the big one CDiscretization holds; its nested param-struct members stay default aliases for now). 10 core structs now templated. Verified byte-identical: robustness harness compiles them header-inline (ALL PASS) + full Exasim rebuild + run-app-regression.sh gives the SAME rel_L2 (poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Remaining Phase-1 work (low-value int param structs + view_1d<T> + threading T,I into nested members) tracked in docs/internals/precision-threading.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…ruct owner)
Phase 2 of dstype->template threading. CDiscretization is the struct OWNER (holds solstruct/resstruct/
commonstruct/... BY VALUE), so templating it is what actually propagates the Phase-1 <T,I> structs.
Unlike CResidual/CAssembler/CPreconditioner/CSolver (already <M> templates that only hold a REFERENCE
to it), it was a plain non-template class with out-of-line methods -- the hard case.
Now `template<class T=::dstype,class I=::Int> class CDiscretizationT`, with
`using CDiscretization = CDiscretizationT<::dstype,::Int>;`. Two mechanisms keep it precision-preserving:
- Shadow aliases inside the class (`using dstype=T; using Int=I;` + `using solstruct=solstructT<T,I>;`
for every Phase-1 struct) so every member declaration and every out-of-line method body is
BYTE-FOR-BYTE UNCHANGED yet resolves to the <T,I> instantiations.
- extern template (discretization.h) + explicit instantiation (discretization.cpp) for the
backend-defined members (file ctor, dtor, finalizeConstruction, compGeometry, compMassInverse,
DG2CG{,2,3}). This preserves the compile-once TU split: main.cpp builds CSolution<M> whose inline
ctor/dtor construct/destroy a `CDiscretization disc` by value but never see those out-of-line
bodies (they live only in the unity ExasimSolver.cpp). The in-memory ctors
(discretization_inmemory.hpp, consumer-only) stay implicit -- consumer TUs see their defs. The 8
`class CDiscretization;` forward decls became `template<class,class> class CDiscretizationT; using ...`.
Verified BOTH paths precision-preserving: consumer (in-memory ctor, pure implicit) robustness ALL
PASS; backend (file ctor via extern template -> library) full rebuild + main.cpp link + app-regression
at the SAME rel_L2 (poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Landed first full build.
Remaining Phase 2 (CResidual/CAssembler/CPreconditioner/CSolver + the exasim shim) tracked in
docs/internals/precision-threading.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…lver (⑤ phase 2 cont.) Extend the FEM classes to <M, T=::dstype, I=::Int> so they carry the precision alongside the model. Much easier than CDiscretization: these are already <M> templates (header-inline via the operators.hpp aggregation, so consumers instantiate them locally -- no extern-template/explicit-instantiation TU split) that only hold a REFERENCE to the owner. Same shadow-alias pattern (using dstype=T; using Int=I; using CDiscretization=CDiscretizationT<T,I>; + sysstruct/precondstruct where used) keeps every member decl and out-of-line body byte-identical; the only explicit edits are the cross-references between them (CAssembler<M>->CAssembler<M,T,I> etc. in the solve-surface signatures) and the 15 forward declarations across the backend (solver.h, preconditioner.h, nonlinearsolver.h, solutionwriter.h). Gotcha fixed: CSolver::gmres returns Int, and an out-of-line return type is looked up in NAMESPACE scope (not the class), so it had to be spelled `I` (the template param) to match the in-class Int(=I); parameter types after the `CSolver<M,T,I>::` are in class scope and needed no change. The ApplyPoly<M> free helpers in gmres.cpp deliberately stay <M> (their default-precision refs match under T=dstype). Verified precision-preserving: consumer robustness ALL PASS + full backend rebuild + main.cpp link + app-regression through the real solver (gmres/PTC) at the SAME rel_L2 (naca0012 2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Remaining Phase 2 = the include/exasim shim (Operator/ShellMat/assemble_matrix take Scalar/Idx). See docs/internals/precision-threading.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…e (⑤) Final Phase-2 piece: the include/exasim PETSc shim now carries the exported precision as template parameters. Operator<M, Scalar=floatTy, Idx=intTy>, ShellMat<Scalar=floatTy>, and make_mass_inverse<M,Scalar,Idx> / assemble_matrix<M,Scalar,Idx>. Operator holds CDiscretizationT<Scalar,Idx>& + CAssembler<M,Scalar,Idx>& + CPreconditioner<M,Scalar,Idx>& + sysstructT<Scalar,Idx>&, exposes scalar_type/index_type from the params, and the zero-copy Vec/Mat wrapping reinterprets PetscScalar/PetscInt as Scalar/Idx. The old GLOBAL ABI static_assert(sizeof(PetscScalar)==sizeof(floatTy)) is replaced by a PER-INSTANTIATION static_assert(sizeof(PetscScalar)==sizeof(Scalar)) inside each wrapper -- it fires only for the precision you actually instantiate, and including petsc.hpp no longer forces a TU-wide precision match. floatTy IS dstype, so the default Operator<M> = Operator<M,dstype,Int> is byte-identical; the threading makes precision an explicit, queryable property of the exported type. Consumers spell bare `ShellMat` as `ShellMat<>` (now a template; CTAD can't deduce Scalar from the apply lambda) -- fixed in robustness + petsc_poisson. Verified (header-only shim -> backend app-regression unaffected): robustness ALL PASS + petsc_poisson (SNES resid 9.08e-14, ShellMat vs matrix-free 0.0, MATAIJ vs matrix-free 2.65e-16) + petsc_heat (PETSc TS drives the heat loop) -- all PASS, values identical to pre-shim. Phase 2 complete: structs -> CDiscretization -> FEM classes -> PETSc shim all thread <T,I> with dstype/Int as defaults; whole stack precision-parameterized + byte-identical under the defaults. See docs/internals/precision-threading.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… start) Phase 3 (kernels + BLAS) start. Introduce blas<T> -- a by-value, precision-dispatched CPU BLAS/LAPACK trait that replaces the per-function `#ifdef USE_FLOAT S.. #else D..` branches with type dispatch (blas<double> -> dgemm/dgetrf/dgetri/dgemv, blas<float> -> the s* symbols; primary template undefined so an unsupported precision is a compile error). This is the pblas.h "choke point" the plan calls out: it lets the pblas wrappers be templated on the scalar type T. Converted this increment (templated on <class T = dstype>, CPU path via blas<T>): the GEMM node/Gauss transforms cpuNode2Gauss / Node2Gauss / Gauss2Node, and the LAPACK inverse cpuComputeInverse. Under the default T=dstype the trait selects exactly the routine the old #ifdef did -> byte-identical. Callers still pass dstype* buffers, so T deduces to dstype (unchanged). The GPU (cublas/hipblas) branches remain #ifdef USE_FLOAT-selected for now -- correct under the default, trait-ification for non-default GPU precision is the remote-verified tail. docs/internals/precision-threading.md now records the Phase-3 plan + THE CUT: the ExasimDriverABI function-pointer boundary (include/driver_abi.hpp) is where the frontend-generated dstype kernels stay frozen while everything below (solver, Discretization kernels, pblas BLAS) can thread T -- so the Matlab/Python/Julia frontends need NO change. The T-vs-dstype-at-the-ABI decision (static_assert vs a conversion shim) is documented and deferred. Verified precision-preserving: robustness ALL PASS + full rebuild + app-regression at the SAME rel_L2 (naca0012 2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…tance A) Adopt stance "A" for the precision cut at the ExasimDriverABI seam: the frontend-generated kernels are hard-typed dstype behind the ABI function pointers, so the AbiAdapter branch of EXASIM_DRIVER_CALL requires the caller's scalar type == the global dstype. Implemented as a forward-activating static_assert(std::is_same_v<dstype, ::dstype>) in that branch: - Today (kernels not yet precision-templated) `dstype` there is the global ::dstype -> assert trivially true -> byte-identical build. - Once a kernel is templated with a `using dstype = T;` shadow (rest of Phase 3), the SAME assert automatically becomes the T==dstype guard -- a clear diagnostic instead of a raw T*->dstype* type error, and the single hook to later swap in a conversion shim (stance B) or Phase-4 templated codegen. Mixed precision stays reachable on the concrete-model path (exasim::Name<M>), which the PETSc export uses and which never touches the frozen ABI. Frontends (Matlab/Python/Julia) need NO change: they keep emitting dstype kernels bound as void(*)(dstype*,...) in ExasimDriverABI. docs/internals/precision-threading.md records A as chosen + the deferred B/Phase-4 upgrade path (both localized to this one macro branch, so A locks in nothing). Verified: full backend rebuild (compiles the AbiAdapter path) + app-regression at the SAME rel_L2 (poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). static_assert emits no code -> byte-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…change (⑤ phase 3)
The MPI companion to blas<T>. MPI datatypes were hardcoded two ways, one of them a live bug:
- pblas.h reductions (PDOT/PGEMTV/PGEMTM): #ifdef USE_FLOAT MPI_FLOAT #else MPI_DOUBLE.
- halo exchange (matvec/residual/residualeval/ldgblockjacobian/solution/postsolution/
solutionwriter/setsysstruct): MPI_Isend/Irecv hardcoded MPI_DOUBLE even though the buffers
(buffsend/buffrecv/tempn/tempg) are dstype* -> a single-precision (USE_FLOAT) build sent float
buffers tagged MPI_DOUBLE, mis-typing every rank exchange. Pre-existing correctness bug.
Consolidate mpi_type<T>() into backend/Common/common.h (double->MPI_DOUBLE, float->MPI_FLOAT,
int->MPI_INT, long->MPI_LONG) -- there was already an identical copy buried late in parmetisexasim
(.hpp+.cpp), invisible to the earlier halo/reduction sites; removed the duplicate so there is one
definition, early enough for everyone. Collapsed the 3 pblas.h reduction #ifdef blocks and fixed 34
halo-exchange sites to mpi_type<dstype>(): byte-identical for the default double build, correct for
single precision, and forward-compatible (auto-becomes mpi_type<T>() once the comm helpers carry a
`using dstype=T` shadow, like the blas<T> trait + the seam static_assert). ParMETIS MPI_DOUBLE left
alone -- that is the partitioner's real_t weights, independent of dstype.
Verified: CPU-MPI (EXASIM_MPI=ON) full rebuild + multi-rank app-regression at unchanged rel_L2
(naca0012 2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9). GPU-MPI verification (dgx-b) next.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…x-b V100) The whole Phase 0-3 stack is now verified on GPU, not just CPU: - petsc_poisson (EXASIM_BACKEND=2): exported shim solves HDG Poisson on a V100 (ShellMat vs op.mat()=0, MATAIJ 2.6e-16), matching CPU. - petsc_poisson_mpi (mpiexec -n 1/2/4, backend=2): multi-rank halo exchange through mpi_type<dstype>() + ParMETIS partition, all PASS (SNES residual ~3e-12). Records the dgx-b build/verify loop (sync.sh dgx -> build-dgx.sh -> build-petsc-gpu-consumer.sh / build-run-petsc-gpumpi.sh) and the GPU-build-with-backend-0 unsupported-combo note. .claude/PLAN.md refreshed to the current precision-threading state (was stale from the app-baseline task). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… 3 cont.) Convert the remaining GEMM/GEMV pblas wrappers to template<class T=dstype> + blas<T>: Gauss2Node1 (accumulate gemm), PGEMNV (gemv), PGEMTV (transpose gemv), PGEMTM (transpose gemm) -- the hot element-block matvec GEMMs. CPU branches now dispatch through blas<T>::gemm/gemv (alpha/beta derefed into the by-value trait interface); the GPU cublas/hipblas branches are untouched (byte-identical on GPU, still #ifdef USE_FLOAT-selected). Callers pass dstype* so T deduces to dstype -> identical. Verified: robustness ALL PASS + full CPU-MPI rebuild + app-regression at the SAME rel_L2 (naca0012 2.94e-13, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). GPU unaffected (GPU branches not touched; last GPU/GPU-MPI verify @6f8013df stands). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…ont.) Template the 8 dstype-typed CPU primitives in cpuimpl.h -- cpuArraySetValue, cpuArrayInsert, cpuArrayMin/Max, cpuElemGeom, cpuGetElemNodes, cpuApplyGivensRotation, cpuBackSolve -- as template<class T=dstype> with a `using dstype=T;` body shadow (signatures retyped to T, bodies unchanged). These are the CPU-side compute primitives the pblas Array ops + discretization geometry call; callers pass dstype* so T deduces to dstype -> byte-identical. Pure CPU (no #ifdef), fully local-verifiable. Verified: robustness ALL PASS + full CPU-MPI rebuild + app-regression at the SAME rel_L2 (cone 1.92e-12, orion 4.31e-14, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…cont.) The big one. Template all 137 dstype-signature functions in kokkosimpl.h -- the Kokkos device compute kernels (flux/stabilization/geometry/array ops/thermo/wall-model + the 9 KOKKOS_INLINE_FUNCTION device helpers) -- as template<class Ty=dstype> with a `using dstype=Ty;` body shadow (signatures retyped, bodies + KOKKOS_LAMBDA parallel_for bodies unchanged). Uses `Ty` not `T` as the param name because several kernels already use `T` as an identifier (temperature: `log(T)/T`, etc.). Callers pass dstype* so Ty deduces to dstype -> byte-identical. No functor structs / no Kokkos::View<> (raw pointers), so the shadow-alias pattern applies uniformly. Verified CPU: robustness ALL PASS + full rebuild + app-regression at the SAME rel_L2 (naca0012 2.94e-13, nsmach8 3.63e-15, poisson3d 9.99e-11, isoq3d 1.95e-9, poisson2d 2.29e-12). GPU verification (dgx-b, nvcc device-compiles the templated lambdas) next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…rified The compute-primitive layer (pblas GEMM/GEMV via blas<T>, cpuimpl 8 primitives, kokkosimpl 137 device kernels) now threads T/Ty. GPU-verified on dgx-b: nvcc device-compiles all 137 kokkosimpl templated lambdas and petsc_poisson backend=2 is byte-identical (residual 9.119e-14, ShellMat 0, MATAIJ 2.62e-16). Remaining Phase 3: the Discretization/*.hpp kernels, comm-helper using-dstype=T threading, and the pblas tail (Array ops / dot / GPU trait methods). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…(⑤ phase 3 cont.) Thread the precision/index template params through every orchestration free function in backend/Discretization/*.hpp: signatures retype struct params to XxxstructT<T,I>& and dstype*->T*, bodies open with `using dstype=T;` so bodies are untouched (byte-identical by construction). Callers already hold solstructT<T,I> members and call foo<M>(sol,...), so T/I deduce from the struct args -- no call-site changes. This is what makes the stance-A seam static_assert auto-activate into the real T==dstype guard (the shadow reaches EXASIM_DRIVER_CALL). Files: uequation/matvec/qequation/wequation/residual/qoicalculation/massinv/ getuhat (compute kernels) + setstructs (in-memory struct setup for CDiscretizationT<T,I> construction). IO/mesh-format helpers left alone. Verified: full rebuild+install+app-regression byte-identical (naca0012 2.94e-13, nsmach8 3.63e-15, poisson3d 9.99e-11, isoq3d 1.95e-9; 10 pass, 2 pre-existing np=4 fails). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…er (dgx-b V100) nvcc device-compiles the templated backend/Discretization/*.hpp headers into the CUDA Exasim library and the petsc_poisson consumer; backend=2 run byte-identical (residual 9.119e-14, ShellMat vs op.mat()=0, MATAIJ 2.622e-16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… phase 3 cont.) Extend the blas<T> trait with dot/copy/scal/axpy (s*/d* per specialization) and template the remaining pblas wrappers on <class T=dstype> with a `using dstype=T;` body shadow: Inverse, PDOT, PNORM (x2), DOT, ArrayCopy, ArrayMultiplyScalar, ArrayAXPY, ArrayAXPBY, ArrayAX, NORM, PGEMNMStridedBached. Each CPU `#ifdef USE_FLOAT S* #else D*` branch collapses to one blas<T>::... call (byte-identical under default T=dstype); PDOT's MPI_Allreduce picks up mpi_type<T>() via the shadow. GPU cublas/hipblas branches left #ifdef. Callers pass dstype* so T deduces -- no call-site changes. Verified: full rebuild+install+app-regression byte-identical (naca0012 2.94e-13, nsmach8 3.63e-15, poisson3d 9.99e-11, isoq3d 1.95e-9; 10 pass, 2 pre-existing np=4 fails). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
pblas tail nvcc-compiled into CUDA lib + petsc_poisson consumer, backend=2 byte-identical (9.119e-14 / 0 / 2.622e-16). PLAN.md now records the full compute+orchestration layer as templated and verified on CPU/CPU-MPI/GPU/GPU-MPI; only the blas<T> GPU trait methods (non-default GPU precision) remain in phase 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… test (⑤ phase 4, concrete-model path) Thread the precision template param through the concrete-model kernel path so a CONSUMER can instantiate the model + its Kokkos kernels at float32 against the existing double-default install -- no Exasim rebuild. This is the payoff of the dstype->T threading: precision is a type-level choice, not a build macro. - include/exasim/drivers.hpp, include/exasim/kernels/*.hpp: *Driver<M>/*_kernel<M> now <M,T=dstype,I=Int>; storage buffers + gather/scatter locals retyped dstype/ double -> T so a float instantiation is pure-float32 (byte-identical at T=dstype). - tests/models/poisson2d.hpp: model is now a struct template Poisson2DT<T> with `using Poisson2D = Poisson2DT<double>` (methods stay non-template so the is_flux_model_v `&M::flux` trait still detects them; double path unchanged). - tests/consumers/model_fp32: new consumer-side test running Poisson2D flux/source/ qoi kernels at float AND double via Kokkos (CPU + GPU). CPU PASS: max |float-double|/|double| = 1.6e-07 (fp32), double vs analytic 3.2e-16. The double petsc_poisson consumer recompiles byte-identical (9.084e-14 / 0 / 2.651e-16), confirming the concrete-model path change is precision-preserving. NOTE: a full float32 SOLVE additionally needs the export/preprocessing boundary (Preprocessed/make_preprocessed hold double structs) templated + a float PETSc for the zero-copy PETSc path -- that is the remaining Phase-5 cutover. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… (CPU+GPU) - tests/consumers/model_fp32/CMakeLists.txt: add EXASIM_GPU option so the test builds against the CUDA install (COMPONENTS gpu) with nvcc_wrapper, same as the petsc consumers. - docs: record the concrete-model path (drivers/kernels/model) as float32-verified on CPU (Serial) + GPU (CUDA V100), max |float-double| = 1.6e-07; and document the Phase-5 boundary a full float32 solve still needs (Preprocessed double structs + zero-copy PetscScalar). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…on <T,I> (phase 5, no conversion) Cross the export/preprocessing boundary WITHOUT a conversion shim: precision flows as a template param through the whole in-memory bundle path. - buildstructs.hpp: Preprocessed -> struct template PreprocessedT<T=dstype,I=Int> (holds XxxstructT<T,I>); the 6 struct builders + the 2 double-array malloc helpers are <T,I> and cast the double Mesh/Master intermediates into T (the intermediates stay double -- only the materialized solve-precision structs are T). - preprocessing.h/.hpp: CPreprocessing::take()/takeParallel() are member templates <T=dstype,I=Int> returning PreprocessedT<T,I> and calling buildXStruct<T,I>. - export.hpp: make_preprocessed<M,T=floatTy,I=intTy> returns PreprocessedT<T,I>. - discretization.h/_inmemory.hpp: in-memory ctor param Preprocessed&& -> PreprocessedT<T,I>&&. Defaults reproduce the old dstype/Int signatures exactly -> double path byte-identical (petsc_poisson consumer 9.084e-14 / 0 / 2.651e-16, unchanged). REMAINING for a full float SOLVE (Phase 5 backend cutover): the in-memory ctor's setup chain -- cpuInitInMemory + setstructs.cpp (setcommonstruct/cpuInitSetup/ setAppRuntimeContext) + connectivity buildConn -- and the CDiscretizationT backend members (finalizeConstruction/compGeometry/compMassInverse/DG2CG) are compiled/ instantiated at dstype only in the unity build; a float disc needs them at float. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…se 5 cutover) The consumer aggregates the backend source (operators.hpp #includes discretization.cpp -> connectivity.cpp + setstructs.cpp), so templating those .cpp lets a consumer instantiate CDiscretizationT<T> from source -- no lib rebuild, no float explicit instantiation (the extern template only pins dstype). - setstructs.cpp: setcommonstruct/cpuInitSetup/setAppRuntimeContext/cpuInit/dev*/gpuInit etc. -> <T=dstype,I=Int> (shadow-alias). connectivity.cpp: buildConn + 32 more. - preprocessing.cpp: take()/takeParallel() -> member templates (mirror preprocessing.hpp), matching the templated CPreprocessing decl. - discretization_inmemory.hpp: cpuInitInMemory + its cpuInitSetup/setAppRuntimeContext forward-decls -> templates. - readbinaryfiles.cpp: readInput's setAppRuntimeContext forward-decl -> template. - Fixed thread_disc regex miss: xiny2 (template<typename T>, no space) had a spurious 2nd template head injected; removed. Verified: full lib rebuild + app-regression 12/12 PASS (isoq/sharpb2 np=4, previously RUN_FAIL, now build+run from the consistent source; all rel_L2 << 1e-8, within the harness's documented ~1e-9 FP-reorder band). Double petsc_poisson consumer byte-identical (9.084e-14 / 0 / 2.651e-16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…s/geometry helpers) Continue the phase-5 cutover toward a full float32 solve as a consumer: - setsysstruct.cpp: setsysstruct + 2 -> <T=dstype,I=Int>. - export.hpp: recover_volume / eval_qoi<M> -> <T,I> (so a float disc's volume field + QoI L2 error are reachable). - connectivity.cpp: fetch_node / equal_row_tol -> template<class T> (coord helpers). - ismeshcurved.cpp: IsMeshCurved -> <T,I>. - discretization.cpp: BuildElementBlockBoundaryFaces / AllocateLDGBlockJacobianMemory -> <T,I>. All byte-identical for the default dstype path: petsc_poisson consumer still PASSES (9.084e-14 / 0 / 2.651e-16), and the full app-regression is 12/12. SCOPE FINDING: a full float32 *solve* consumer (instantiating CDiscretizationT<float> end-to-end) surfaces ~167 precision-mixing sites across the whole backend -- ArraySetValue (48), PGEMNMStridedBached (34), ArrayAXPBY, the geometry (ElemGeom/FaceGeom), every model kernel, the residual (Ru/Rq Elem/Face), the preconditioner, and the non-templated CWallModel class. Each is a templated caller passing a double literal/constant mixed with T buffers. This is a backend-wide cutover (not "one lib rebuild"); the preprocessing + in-memory setup boundary is done + verified, and this is the remaining, larger phase-5 body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…ve = ~167-site backend cutover Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
… works via templates The ~167 precision-mixing sites are resolved; a consumer now instantiates CDiscretizationT<float> end-to-end and runs a full HDG Poisson SOLVE, choosing precision purely via templates (no Exasim rebuild, no conversion shim). Root fixes, by category: - noDeduce_t<T> (common.h): scalar function params spelled noDeduce_t<T> deduce T only from the buffer args, so a double literal/constant (0.0, `zero`, `minusone`, a tolerance) passed next to a float* buffer converts instead of forcing a conflicting deduction. Applied to the pblas/kokkosimpl array+blas ops (ArraySetValue, PGEMNMStridedBached, ArrayAXPBY/AXY/AXPB, UpdateUDG, ApplyDtcoef, StgInflow*, ...). - drivers.hpp: *_kernel<M>(...) -> <M,T>; compute_shape templated; the interface Fint/Fext forwarders deduce the scalar from the first buffer arg and compile out for non-default precision (coupling is double-only; single-model solves never hit an interface face). - kernels/*.hpp: retype the gather/scatter locals double->T in eos/sourcew/tdfunc/ interface/output/visualization; init.hpp InitFn -> InitFn_t<T> (+ init_dispatch reorder). - model.hpp: the ModelDefaults hierarchy (8 mixins) + zero_fill templated on scalar, so a float model's inherited defaults (initudg/sourcew/...) are float-typed; poisson2d.hpp inherits ModelDefaults<Poisson2DT<T>, T>. - geometry.hpp (the .hpp is the live one), uresidual/qresidual.hpp, matvec.hpp (diag_blocks), connectivity.cpp (make_cell_key/xiny/fetch_node/equal_row_tol/permindex coord locals), discretization.cpp (crs_init, Build*/Allocate* helpers; wall-model branch if-constexpr-guarded to dstype), setprecondstruct.cpp -> all <T,I>. - common.h: qoiparamsstruct/wallmodelparamsstruct -> templated sub-structs. tests/consumers/solve_fp32: builds the in-memory HDG Poisson stack at double AND float, condenses to H*uh=b, LAPACK-solves, and compares the float trace solution to the double one. PASS: ||uh_f32 - uh_f64|| / ||uh_f64|| = 1.01e-4 (fp32 dense-LU accuracy). Double path fully intact: full lib rebuild + app-regression 12/12 PASS (unchanged rel_L2 << 1e-8); petsc_poisson consumer byte-identical. KNOWN FOLLOW-UP: the float32 eval_qoi volume-integral readback underflows to ~0 (qoi[1]=1e-19 vs 0.405) -- a post-processing quadrature-precision quirk; the SOLVE itself is correct (float ||uh|| matches double to fp32). Tracked separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…known qoi follow-up) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…OT read as garbage) blas<float>::dot called the BLAS sdot_, but on several platforms (macOS Accelerate, f2c/CLAPACK) the F77 single-precision REAL function sdot_ returns a *double* in the ABI while common.h declares `float SDOT(...)`, so the return was misread as garbage (~1e-19). This silently zeroed every single-precision PDOT/DOT/NORM -> the float32 eval_qoi volume integral read ~0 (qoi[1]=1e-19 instead of 0.405), even though the solve itself was correct. Fix: compute blas<float>::dot with a manual loop (accumulate in double for a well-conditioned reduction), sidestepping the sdot_ ABI landmine. DDOT (double) is untouched, so the default path is byte-identical. Now the float32 solve_fp32 QoI is correct: qoi[1]=0.405249 (vs double 0.405285, fp32 precision), qoi[0]=2.45e-9. Consumer still PASS (uh vs double = 1.01e-4). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
…one, sdot_ ABI bug fixed Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019dc1hMgBWTPMZcYoxPYcn4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.