Skip to content

feat(bench): add encode-latency harness and record M1 baseline results#35

Merged
dinesh-git17 merged 1 commit into
mainfrom
feat/benchmark-harness
Apr 15, 2026
Merged

feat(bench): add encode-latency harness and record M1 baseline results#35
dinesh-git17 merged 1 commit into
mainfrom
feat/benchmark-harness

Conversation

@dinesh-git17

Copy link
Copy Markdown
Owner

Summary

Closes Task 4-4 by landing the encode-latency benchmark harness (scripts/bench_encode.py) and a fully filled-in results document (docs/benchmarks.md) with the v1 baseline measurements captured on the reference Apple M1 machine. Every acceptance criterion for Task 4-4 is now satisfied: the script runs against a saved tokenizer, the doc contains all required machine and timing data, vocab_size=512 training time is recorded (trainer span 4,620.24 ms), and vocab_size=32000 completion is documented (early-stopped at 21,272 merges of 31,744 planned, 185 s command wall clock).

Why

The PRD requires baseline benchmark evidence for the v1 release: one small-vocab training run on TinyShakespeare, one encode-latency measurement over 100 runs of a fixed 50-word sentence, and one large-vocab completion check. Task 4-4 is the Phase 4 deliverable that produces that evidence. This PR is the agreed split: I scaffolded the code and the document, ran the four quality gates, verified the harness end-to-end via uv run, and Dinesh ran the three benchmark commands on the reference machine. The numbers in the doc are real measurements from that run, not placeholders.

Changes

scripts/bench_encode.py — new 200-line stdlib-only benchmark harness

  • Loads a schema v1 tokenizer via from bpetite import Tokenizer (installed package path, not internal modules).
  • Encodes a hard-coded 50-word sentence 100 times by default (configurable via --runs).
  • Measures each call with time.perf_counter() and reports two summary statistics:
    • p50 via statistics.median (average of the 50th and 51st sorted values for N=100)
    • p99 via nearest-rank on the sorted sample (sorted[ceil(0.99 * N) - 1], which is sorted[98] for N=100) — the PRD-pinned variant, intentionally distinct from linear interpolation
  • Output channels match the bpetite CLI contract: compact JSON summary on stdout, human-readable panel on stderr. No rich dependency — plain sys.stderr.write keeps the script portable.
  • Defensive 50-word guard at startup: if _BENCHMARK_SENTENCE.split() is no longer exactly 50 words, aborts with a clean exit-1 and a clear error message before any measurement runs.
  • _positive_int argparse type function rejects --runs 0 or negative values at parse time with argparse's standard exit-2 usage error, preventing the downstream StatisticsError traceback that would otherwise fire on an empty sample.
  • OSError load handler catches IsADirectoryError, PermissionError, and other unreadable-file failures from Path.read_text() inside Tokenizer.load, mirroring the _load_model_or_exit pattern in the CLI. Ordering is load-bearing: FileNotFoundError is caught first because it is an OSError subclass.

docs/benchmarks.md — new 7.4 KB results document

  • Jekyll-compatible frontmatter (title, description, slug: benchmarks, order: 2, category: Reference, published: true). Verified that uv run --group docs python docs/site/build.py picks the page up and renders benchmarks.html alongside the PRD, Phase 2, and Phase 3 docs.
  • Environment section: Apple M1, 8 GB RAM / macOS 26.3.1 (Darwin 25.3.0) / Python 3.12.12 / bpetite commit ce2dea1 (runtime under test; this PR adds the harness on top without touching runtime code).
  • Training at vocab_size=512: trainer elapsed 4,620.24 ms (full completion, 256/256 merges), corpus 1,115,394 bytes.
  • Encode latency: 50-word sentence → 171 tokens, p50 3.4399 ms, p99 3.6521 ms, mean 3.4512 ms, spread 3.36–3.70 ms (~10 % around median, no long tail).
  • Demo training at vocab_size=32000: early-stopped at 21,272 merges of 31,744 planned (TinyShakespeare runs out of distinct byte-pair co-occurrences well before the target — documented early-stop path in _trainer.py, not a failure). Trainer elapsed elapsed_ms 184,923.74 ms; outer time(1) 3:05.09 total (184.02 s user, 0.89 s sys, 99 % cpu). Actual mergeable vocab 21,528.
  • Every command on the page is documented for reproduction, including the exact time uv run bpetite train ... invocation for the 32000 run.

Validation

  • uv run pytest — 192 passed
  • uv run ruff check . — clean
  • uv run ruff format --check . — 31 files already formatted
  • uv run mypy --strict — no issues found in 30 source files

Manual checks:

  • End-to-end run verified in this session: corpus download → vocab_size=512 train → encode benchmark → vocab_size=32000 demo train. All four steps exit 0; the numbers in docs/benchmarks.md are real outputs from that run, not scaffolded placeholders.
  • Four-input smoke test on the harness error paths: happy path (exit 0), --runs 0 (argparse exit 2 with --runs must be >= 1, got 0), --runs -5 (same argparse exit 2), --model data (exit 1 with Error: model unreadable: data: [Errno 21] Is a directory: 'data'). No tracebacks on any failure path.
  • Two Codex review rounds before this PR. Round 1 caught two P2 bugs on the harness (--runs validation, OSError load handler); round 2 caught a P2 reproducibility bug in the doc (bare python vs uv run python) and a P3 accuracy bug in the timing labels (elapsed_ms is the trainer span only, not the full pipeline). Both rounds' findings are fixed in this PR.
  • Site build independently verified: uv run --group docs python docs/site/build.py emits 10 published pages including the new benchmarks.html, with all three key measurements (4,620.24, 184,923.74, Apple M1) present in the rendered HTML.

Risks / Follow-ups

  • The benchmark numbers are captured on a single machine (Apple M1, 8 GB RAM) and are not regression-gated. They are a baseline snapshot, not a threshold. If future commits slow the encode path by an order of magnitude, nothing in CI will catch it — the benchmark is demo-only per the task schema.
  • Task 4-5 (Final README) is the next Phase 4 task. The README will want to link to docs/benchmarks.md as the authoritative performance reference, and should use uv run python / uv run bpetite uniformly for every reproduction command (consistent with this PR's fix for docs/benchmarks.md:43).

@github-actions

Copy link
Copy Markdown

bpetite workflows

Workflow Status Comment if failure and where
tests success ok
lint success ok
syntax success ok
format success ok
types success ok
build success ok
cli-smoke success ok
determinism success ok
policy-guard success ok
ci-meta pending waiting

PR #35: tracked workflows are still running.

@dinesh-git17 dinesh-git17 merged commit f98d7a8 into main Apr 15, 2026
14 checks passed
@dinesh-git17 dinesh-git17 deleted the feat/benchmark-harness branch April 15, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant