feat(bench): add encode-latency harness and record M1 baseline results#35
Merged
Conversation
bpetite workflows
PR #35: tracked workflows are still running. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes Task 4-4 by landing the encode-latency benchmark harness (
scripts/bench_encode.py) and a fully filled-in results document (docs/benchmarks.md) with the v1 baseline measurements captured on the reference Apple M1 machine. Every acceptance criterion for Task 4-4 is now satisfied: the script runs against a saved tokenizer, the doc contains all required machine and timing data,vocab_size=512training time is recorded (trainer span 4,620.24 ms), andvocab_size=32000completion is documented (early-stopped at 21,272 merges of 31,744 planned, 185 s command wall clock).Why
The PRD requires baseline benchmark evidence for the v1 release: one small-vocab training run on TinyShakespeare, one encode-latency measurement over 100 runs of a fixed 50-word sentence, and one large-vocab completion check. Task 4-4 is the Phase 4 deliverable that produces that evidence. This PR is the agreed split: I scaffolded the code and the document, ran the four quality gates, verified the harness end-to-end via
uv run, and Dinesh ran the three benchmark commands on the reference machine. The numbers in the doc are real measurements from that run, not placeholders.Changes
scripts/bench_encode.py— new 200-line stdlib-only benchmark harnessfrom bpetite import Tokenizer(installed package path, not internal modules).--runs).time.perf_counter()and reports two summary statistics:p50viastatistics.median(average of the 50th and 51st sorted values forN=100)p99via nearest-rank on the sorted sample (sorted[ceil(0.99 * N) - 1], which issorted[98]forN=100) — the PRD-pinned variant, intentionally distinct from linear interpolationrichdependency — plainsys.stderr.writekeeps the script portable._BENCHMARK_SENTENCE.split()is no longer exactly 50 words, aborts with a clean exit-1 and a clear error message before any measurement runs._positive_intargparse type function rejects--runs 0or negative values at parse time with argparse's standard exit-2 usage error, preventing the downstreamStatisticsErrortraceback that would otherwise fire on an empty sample.OSErrorload handler catchesIsADirectoryError,PermissionError, and other unreadable-file failures fromPath.read_text()insideTokenizer.load, mirroring the_load_model_or_exitpattern in the CLI. Ordering is load-bearing:FileNotFoundErroris caught first because it is anOSErrorsubclass.docs/benchmarks.md— new 7.4 KB results documenttitle,description,slug: benchmarks,order: 2,category: Reference,published: true). Verified thatuv run --group docs python docs/site/build.pypicks the page up and rendersbenchmarks.htmlalongside the PRD, Phase 2, and Phase 3 docs.bpetitecommitce2dea1(runtime under test; this PR adds the harness on top without touching runtime code).vocab_size=512: trainer elapsed 4,620.24 ms (full completion, 256/256 merges), corpus 1,115,394 bytes.vocab_size=32000: early-stopped at 21,272 merges of 31,744 planned (TinyShakespeare runs out of distinct byte-pair co-occurrences well before the target — documented early-stop path in_trainer.py, not a failure). Trainer elapsedelapsed_ms184,923.74 ms; outertime(1)3:05.09 total (184.02 s user, 0.89 s sys, 99 % cpu). Actual mergeable vocab 21,528.time uv run bpetite train ...invocation for the 32000 run.Validation
uv run pytest— 192 passeduv run ruff check .— cleanuv run ruff format --check .— 31 files already formatteduv run mypy --strict— no issues found in 30 source filesManual checks:
vocab_size=512train → encode benchmark →vocab_size=32000demo train. All four steps exit 0; the numbers indocs/benchmarks.mdare real outputs from that run, not scaffolded placeholders.--runs 0(argparse exit 2 with--runs must be >= 1, got 0),--runs -5(same argparse exit 2),--model data(exit 1 withError: model unreadable: data: [Errno 21] Is a directory: 'data'). No tracebacks on any failure path.--runsvalidation,OSErrorload handler); round 2 caught a P2 reproducibility bug in the doc (barepythonvsuv run python) and a P3 accuracy bug in the timing labels (elapsed_msis the trainer span only, not the full pipeline). Both rounds' findings are fixed in this PR.uv run --group docs python docs/site/build.pyemits 10 published pages including the newbenchmarks.html, with all three key measurements (4,620.24,184,923.74,Apple M1) present in the rendered HTML.Risks / Follow-ups
docs/benchmarks.mdas the authoritative performance reference, and should useuv run python/uv run bpetiteuniformly for every reproduction command (consistent with this PR's fix fordocs/benchmarks.md:43).