kobold-bench

Benchmark harness for the gnucobol-rs hot path — the cob_move DISPLAY ↔ COMP-3 conversions that dominate legacy-file ingestion — with a parity re-check after every run. The doctrine: performance work never alters sealed semantics. A throughput number is never reported without re-confirming the byte-exact result.

Baseline (correctness mode — no SIMD, no parallelism)

Measured by kobold-throughput on one developer machine (x86-64, single thread, --release, overflow-checks = true). Your numbers will differ; reproduce with cargo run --release --bin kobold-throughput -- 5000000.

Conversion	Throughput	Bytes	Parity
DISPLAY `S9(7)V99` → COMP-3	~95 M records/sec	~860 MB/sec (source)	re-checked byte-exact

For scale intuition: a 100-million-record nightly batch with this single field is ~1 second of CPU for the decimal conversion, single-threaded — before any parallel/SIMD feature. The kernel is allocation-free per conversion (fixed stack buffers), which is what makes it Lambda/Glue-friendly.

These are baseline numbers on purpose. The point is a provably correct primitive that is already fast; optional acceleration is gravy, gated, and re-proven.

Run it

cargo run --release --bin kobold-throughput -- 10000000   # quick throughput probe + parity re-check
cargo bench                                                # criterion micro-benchmarks (per-direction)

cargo bench runs benches/cob_move.rs (Criterion) for display_to_packed and packed_to_display with Throughput::Elements, reporting time/conversion and elements/sec.

Methodology

Synthetic, reproducible batches — a deterministic LCG (gen_display_batch) so the input is identical across machines and runs; mixed signs (overpunch) to exercise the sign path.
Parity re-check is mandatory — parity_holds round-trips a sample (DISPLAY → COMP-3 → DISPLAY) and asserts the decoded value is unchanged. kobold-throughput exits non-zero if it fails. No number ships without this.
Honest accounting — single-threaded, one field; multi-field records and end-to-end shim decode (with copybook layout) are heavier. This harness measures the kernel conversion, the part that must be both correct and fast.

Gated acceleration — the plan (not yet implemented, never default)

Performance features must be strictly optional, gated, and semantics-preserving. The intended shape (tracked, with each path re-running the full gnucobol-rs differential sweep + Kani suite in CI before it can be claimed):

parallel (Rayon). Batch-level par_iter over independent records — embarrassingly parallel, near-linear on the 8–64 vCPU instances common on AWS. No change to per-record bytes.
simd. Vectorized nibble pack/unpack and overpunch handling for the COMP-3 inner loop, in an isolated unsafe module with a scalar fallback and runtime CPU-feature detection. The default build stays #![forbid(unsafe_code)].

Each, if added, is labelled "accelerated (feature-enabled)" vs the "baseline (correctness mode)" here, reported in compat_profile, and listed as not part of the sealed courts.

AWS cost framing

Lambda/Glue bill on duration. A correct-and-fast kernel turns a decimal-heavy batch from a multi-hour job into minutes, cutting compute spend and shortening reconciliation windows. Pair these numbers with the kobold-data-shim parity receipts and the kobold-lambda-layer packaging for a full S3 → verified-records reference architecture.

License

Apache-2.0 (LICENSE). Links gnucobol-rs (LGPL-3.0-or-later) — see NOTICE for the binary-distribution obligations.

End-to-end scalar benchmark (`KOBOLD.BENCH.2`)

kobold-bench2 measures the full shim reconciliation pipeline (FILE.1 ingest → decode → LEVEL-88 → audit) over a synthetic happy corpus, in scalar mode (no Rayon, no SIMD, no fast mode). Timing is admitted only after the output/audit hash matches the pinned baseline — a benchmark must never alter or outrun sealed semantics. A mismatch aborts with PARITY FAIL and no timing.

cargo run --release --bin kobold-bench2 -- 50000   # records
# -> reports/BENCH-2-receipt.json (records/sec, µs/record, decode-only vs full split, host/profile)

It records the output sha256, the decode-only vs full-pipeline split (so ingest+audit overhead is visible), and host (cpu/arch/profile, rayon:false, simd:false). The hostile fixtures the courts fail-close on live in kobold-data-shim's KOBOLD.CORPUS.2. No production, AWS, parallel, or customer-workload throughput is claimed.

Gated parallelism (`KOBOLD.PERF.1`)

kobold-bench2 --features rayon adds record-level Rayon — admitted only when its output hash is byte-identical to the scalar baseline over a fixed reference corpus (perf1 parity: rayon == scalar); a mismatch aborts with RAYON PARITY FAIL and no timing. The receipt reports both modes + the speedup + custody_us_per_record (POSTING.1/EXTRACT.PROFILE.1). Same evidence, faster — never weaker evidence, faster. No production/AWS/SIMD/parallel-throughput claim; parallelism changes no emitted artifact.

Local scale measurement (`KOBOLD.SCALE.1`)

kobold-scale [100m|1g|5g|10g] generates a declared synthetic mixed fixed-record corpus to a temp file and streams it through the sealed reconcile pipeline in fixed reconcile-blocks — so memory stays bounded even at multi-GB (1 GB corpus ≈ 57 MB peak RSS here). Scalar and Rayon use the same block unit, so their output hashes are byte-identical by construction; **Rayon timing is admitted only after that match

a pinned per-size baseline** (else SCALE PARITY FAIL). The receipt records wall time, throughput, peak RSS, temp disk, and the POSTING.1 hash chain.

cargo run --release --bin kobold-scale --features rayon -- 1g
# -> reports/SCALE-1-receipt-1g.json (admitted) + SCALE-1-baseline-1g.json (pinned)

Caution

No production SLA · no AWS cost · no mainframe equivalence · no universal throughput · no customer-workload representativeness. A synthetic-corpus number on one host is exactly that.

Per-stage profiling (`KOBOLD.PERF.2`)

kobold-bench2 now reports a perf2_stage_profile — the reconcile pipeline's three stages (parse / per-record / aggregate) with the bottleneck named. On this host the per-record stage dominates (~75%, the part PERF.1's Rayon parallelizes byte-identically), aggregation ~25% (serial/ordered), parse ~0.5%. Profiling never changes the emitted bytes; parallelism only touches record-local work.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
assets		assets
benches		benches
reports		reports
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kobold-bench

Baseline (correctness mode — no SIMD, no parallelism)

Run it

Methodology

Gated acceleration — the plan (not yet implemented, never default)

AWS cost framing

License

End-to-end scalar benchmark (`KOBOLD.BENCH.2`)

Gated parallelism (`KOBOLD.PERF.1`)

Local scale measurement (`KOBOLD.SCALE.1`)

Per-stage profiling (`KOBOLD.PERF.2`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kobold-bench

Baseline (correctness mode — no SIMD, no parallelism)

Run it

Methodology

Gated acceleration — the plan (not yet implemented, never default)

AWS cost framing

License

End-to-end scalar benchmark (KOBOLD.BENCH.2)

Gated parallelism (KOBOLD.PERF.1)

Local scale measurement (KOBOLD.SCALE.1)

Per-stage profiling (KOBOLD.PERF.2)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

End-to-end scalar benchmark (`KOBOLD.BENCH.2`)

Gated parallelism (`KOBOLD.PERF.1`)

Local scale measurement (`KOBOLD.SCALE.1`)

Per-stage profiling (`KOBOLD.PERF.2`)

Packages