feat(bench): add encode-latency harness and record M1 baseline results by dinesh-git17 · Pull Request #35 · dinesh-git17/bpetite

dinesh-git17 · 2026-04-15T10:45:53Z

Summary

Closes Task 4-4 by landing the encode-latency benchmark harness (scripts/bench_encode.py) and a fully filled-in results document (docs/benchmarks.md) with the v1 baseline measurements captured on the reference Apple M1 machine. Every acceptance criterion for Task 4-4 is now satisfied: the script runs against a saved tokenizer, the doc contains all required machine and timing data, vocab_size=512 training time is recorded (trainer span 4,620.24 ms), and vocab_size=32000 completion is documented (early-stopped at 21,272 merges of 31,744 planned, 185 s command wall clock).

Why

The PRD requires baseline benchmark evidence for the v1 release: one small-vocab training run on TinyShakespeare, one encode-latency measurement over 100 runs of a fixed 50-word sentence, and one large-vocab completion check. Task 4-4 is the Phase 4 deliverable that produces that evidence. This PR is the agreed split: I scaffolded the code and the document, ran the four quality gates, verified the harness end-to-end via uv run, and Dinesh ran the three benchmark commands on the reference machine. The numbers in the doc are real measurements from that run, not placeholders.

Changes

`scripts/bench_encode.py` — new 200-line stdlib-only benchmark harness

Loads a schema v1 tokenizer via from bpetite import Tokenizer (installed package path, not internal modules).
Encodes a hard-coded 50-word sentence 100 times by default (configurable via --runs).
Measures each call with time.perf_counter() and reports two summary statistics:
- p50 via statistics.median (average of the 50th and 51st sorted values for N=100)
- p99 via nearest-rank on the sorted sample (sorted[ceil(0.99 * N) - 1], which is sorted[98] for N=100) — the PRD-pinned variant, intentionally distinct from linear interpolation
Output channels match the bpetite CLI contract: compact JSON summary on stdout, human-readable panel on stderr. No rich dependency — plain sys.stderr.write keeps the script portable.
Defensive 50-word guard at startup: if _BENCHMARK_SENTENCE.split() is no longer exactly 50 words, aborts with a clean exit-1 and a clear error message before any measurement runs.
_positive_int argparse type function rejects --runs 0 or negative values at parse time with argparse's standard exit-2 usage error, preventing the downstream StatisticsError traceback that would otherwise fire on an empty sample.
OSError load handler catches IsADirectoryError, PermissionError, and other unreadable-file failures from Path.read_text() inside Tokenizer.load, mirroring the _load_model_or_exit pattern in the CLI. Ordering is load-bearing: FileNotFoundError is caught first because it is an OSError subclass.

`docs/benchmarks.md` — new 7.4 KB results document

Jekyll-compatible frontmatter (title, description, slug: benchmarks, order: 2, category: Reference, published: true). Verified that uv run --group docs python docs/site/build.py picks the page up and renders benchmarks.html alongside the PRD, Phase 2, and Phase 3 docs.
Environment section: Apple M1, 8 GB RAM / macOS 26.3.1 (Darwin 25.3.0) / Python 3.12.12 / bpetite commit ce2dea1 (runtime under test; this PR adds the harness on top without touching runtime code).
Training at vocab_size=512: trainer elapsed 4,620.24 ms (full completion, 256/256 merges), corpus 1,115,394 bytes.
Encode latency: 50-word sentence → 171 tokens, p50 3.4399 ms, p99 3.6521 ms, mean 3.4512 ms, spread 3.36–3.70 ms (~10 % around median, no long tail).
Demo training at vocab_size=32000: early-stopped at 21,272 merges of 31,744 planned (TinyShakespeare runs out of distinct byte-pair co-occurrences well before the target — documented early-stop path in _trainer.py, not a failure). Trainer elapsed elapsed_ms 184,923.74 ms; outer time(1) 3:05.09 total (184.02 s user, 0.89 s sys, 99 % cpu). Actual mergeable vocab 21,528.
Every command on the page is documented for reproduction, including the exact time uv run bpetite train ... invocation for the 32000 run.

Validation

uv run pytest — 192 passed
uv run ruff check . — clean
uv run ruff format --check . — 31 files already formatted
uv run mypy --strict — no issues found in 30 source files

Manual checks:

End-to-end run verified in this session: corpus download → vocab_size=512 train → encode benchmark → vocab_size=32000 demo train. All four steps exit 0; the numbers in docs/benchmarks.md are real outputs from that run, not scaffolded placeholders.
Four-input smoke test on the harness error paths: happy path (exit 0), --runs 0 (argparse exit 2 with --runs must be >= 1, got 0), --runs -5 (same argparse exit 2), --model data (exit 1 with Error: model unreadable: data: [Errno 21] Is a directory: 'data'). No tracebacks on any failure path.
Two Codex review rounds before this PR. Round 1 caught two P2 bugs on the harness (--runs validation, OSError load handler); round 2 caught a P2 reproducibility bug in the doc (bare python vs uv run python) and a P3 accuracy bug in the timing labels (elapsed_ms is the trainer span only, not the full pipeline). Both rounds' findings are fixed in this PR.
Site build independently verified: uv run --group docs python docs/site/build.py emits 10 published pages including the new benchmarks.html, with all three key measurements (4,620.24, 184,923.74, Apple M1) present in the rendered HTML.

Risks / Follow-ups

The benchmark numbers are captured on a single machine (Apple M1, 8 GB RAM) and are not regression-gated. They are a baseline snapshot, not a threshold. If future commits slow the encode path by an order of magnitude, nothing in CI will catch it — the benchmark is demo-only per the task schema.
Task 4-5 (Final README) is the next Phase 4 task. The README will want to link to docs/benchmarks.md as the authoritative performance reference, and should use uv run python / uv run bpetite uniformly for every reproduction command (consistent with this PR's fix for docs/benchmarks.md:43).

github-actions · 2026-04-15T10:46:17Z

bpetite workflows

Workflow	Status	Comment if failure and where
tests	success	ok
lint	success	ok
syntax	success	ok
format	success	ok
types	success	ok
build	success	ok
cli-smoke	success	ok
determinism	success	ok
policy-guard	success	ok
ci-meta	pending	waiting

PR #35: tracked workflows are still running.

feat(bench): add encode-latency harness and record M1 baseline results

9f5382b

github-actions Bot added area/docs type/feat labels Apr 15, 2026

dinesh-git17 merged commit f98d7a8 into main Apr 15, 2026
14 checks passed

dinesh-git17 deleted the feat/benchmark-harness branch April 15, 2026 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): add encode-latency harness and record M1 baseline results#35

feat(bench): add encode-latency harness and record M1 baseline results#35
dinesh-git17 merged 1 commit into
mainfrom
feat/benchmark-harness

dinesh-git17 commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dinesh-git17 commented Apr 15, 2026

Summary

Why

Changes

scripts/bench_encode.py — new 200-line stdlib-only benchmark harness

docs/benchmarks.md — new 7.4 KB results document

Validation

Risks / Follow-ups

Uh oh!

github-actions Bot commented Apr 15, 2026

bpetite workflows

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`scripts/bench_encode.py` — new 200-line stdlib-only benchmark harness

`docs/benchmarks.md` — new 7.4 KB results document