Benchmark harness for the gnucobol-rs hot
path — the cob_move DISPLAY ↔ COMP-3 conversions that dominate legacy-file ingestion — with a
parity re-check after every run. The doctrine: performance work never alters sealed semantics.
A throughput number is never reported without re-confirming the byte-exact result.
Measured by kobold-throughput on one developer machine (x86-64, single thread, --release,
overflow-checks = true). Your numbers will differ; reproduce with cargo run --release --bin kobold-throughput -- 5000000.
| Conversion | Throughput | Bytes | Parity |
|---|---|---|---|
DISPLAY S9(7)V99 → COMP-3 |
~95 M records/sec | ~860 MB/sec (source) | re-checked byte-exact |
For scale intuition: a 100-million-record nightly batch with this single field is ~1 second of
CPU for the decimal conversion, single-threaded — before any parallel/SIMD feature. The kernel is
allocation-free per conversion (fixed stack buffers), which is what makes it Lambda/Glue-friendly.
These are baseline numbers on purpose. The point is a provably correct primitive that is already fast; optional acceleration is gravy, gated, and re-proven.
cargo run --release --bin kobold-throughput -- 10000000 # quick throughput probe + parity re-check
cargo bench # criterion micro-benchmarks (per-direction)cargo bench runs benches/cob_move.rs (Criterion) for display_to_packed and packed_to_display
with Throughput::Elements, reporting time/conversion and elements/sec.
- Synthetic, reproducible batches — a deterministic LCG (
gen_display_batch) so the input is identical across machines and runs; mixed signs (overpunch) to exercise the sign path. - Parity re-check is mandatory —
parity_holdsround-trips a sample (DISPLAY → COMP-3 → DISPLAY) and asserts the decoded value is unchanged.kobold-throughputexits non-zero if it fails. No number ships without this. - Honest accounting — single-threaded, one field; multi-field records and end-to-end shim decode (with copybook layout) are heavier. This harness measures the kernel conversion, the part that must be both correct and fast.
Performance features must be strictly optional, gated, and semantics-preserving. The intended
shape (tracked, with each path re-running the full gnucobol-rs differential sweep + Kani suite in
CI before it can be claimed):
parallel(Rayon). Batch-levelpar_iterover independent records — embarrassingly parallel, near-linear on the 8–64 vCPU instances common on AWS. No change to per-record bytes.simd. Vectorized nibble pack/unpack and overpunch handling for the COMP-3 inner loop, in an isolatedunsafemodule with a scalar fallback and runtime CPU-feature detection. The default build stays#![forbid(unsafe_code)].
Each, if added, is labelled "accelerated (feature-enabled)" vs the "baseline (correctness
mode)" here, reported in compat_profile, and listed as not part of the sealed courts.
Lambda/Glue bill on duration. A correct-and-fast kernel turns a decimal-heavy batch from a
multi-hour job into minutes, cutting compute spend and shortening reconciliation windows. Pair these
numbers with the kobold-data-shim parity
receipts and the kobold-lambda-layer
packaging for a full S3 → verified-records reference architecture.
Apache-2.0 (LICENSE). Links gnucobol-rs (LGPL-3.0-or-later) — see NOTICE for the
binary-distribution obligations.
kobold-bench2 measures the full shim reconciliation pipeline (FILE.1 ingest → decode → LEVEL-88 →
audit) over a synthetic happy corpus, in scalar mode (no Rayon, no SIMD, no fast mode). Timing is
admitted only after the output/audit hash matches the pinned baseline — a benchmark must never alter or
outrun sealed semantics. A mismatch aborts with PARITY FAIL and no timing.
cargo run --release --bin kobold-bench2 -- 50000 # records
# -> reports/BENCH-2-receipt.json (records/sec, µs/record, decode-only vs full split, host/profile)It records the output sha256, the decode-only vs full-pipeline split (so ingest+audit overhead is
visible), and host (cpu/arch/profile, rayon:false, simd:false). The hostile fixtures the courts
fail-close on live in kobold-data-shim's KOBOLD.CORPUS.2. No production, AWS, parallel, or
customer-workload throughput is claimed.
kobold-bench2 --features rayon adds record-level Rayon — admitted only when its output hash is
byte-identical to the scalar baseline over a fixed reference corpus (perf1 parity: rayon == scalar); a
mismatch aborts with RAYON PARITY FAIL and no timing. The receipt reports both modes + the speedup +
custody_us_per_record (POSTING.1/EXTRACT.PROFILE.1). Same evidence, faster — never weaker evidence,
faster. No production/AWS/SIMD/parallel-throughput claim; parallelism changes no emitted artifact.
kobold-scale [100m|1g|5g|10g] generates a declared synthetic mixed fixed-record corpus to a temp file
and streams it through the sealed reconcile pipeline in fixed reconcile-blocks — so memory stays
bounded even at multi-GB (1 GB corpus ≈ 57 MB peak RSS here). Scalar and Rayon use the same block unit,
so their output hashes are byte-identical by construction; **Rayon timing is admitted only after that match
- a pinned per-size baseline** (else
SCALE PARITY FAIL). The receipt records wall time, throughput, peak RSS, temp disk, and the POSTING.1 hash chain.
cargo run --release --bin kobold-scale --features rayon -- 1g
# -> reports/SCALE-1-receipt-1g.json (admitted) + SCALE-1-baseline-1g.json (pinned)Caution
No production SLA · no AWS cost · no mainframe equivalence · no universal throughput · no customer-workload representativeness. A synthetic-corpus number on one host is exactly that.
kobold-bench2 now reports a perf2_stage_profile — the reconcile pipeline's three stages (parse /
per-record / aggregate) with the bottleneck named. On this host the per-record stage dominates (~75%,
the part PERF.1's Rayon parallelizes byte-identically), aggregation ~25% (serial/ordered), parse ~0.5%.
Profiling never changes the emitted bytes; parallelism only touches record-local work.
