feat(cli): implement train/encode/decode with rich stderr UX#32
Merged
Conversation
bpetite workflows
PR #32: tracked workflows are still running. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Task 4-1: the
bpetiteCLI withtrain,encode, anddecodesubcommands. Rich powers the stderr presentation (ASCII banner, config /
completion panels, styled lifecycle lines, error panels) while
sys.stdout.writeholds the stdout contract byte-for-byte (FR-33 / FR-34).The banner and interactive panels are TTY-gated via
is_fully_interactive()so machine consumers see clean stdout and an empty success-stderr.
Why
Phase 4 kickoff. With Phase 3 closed, the core library has a settled public
API; this PR exposes it behind a user-facing entry point. The "beautiful
Rich UX" question deferred during Phases 1-3 was resolved here:
richisadopted as a runtime dep alongside
regex;structlogwas explicitlyrejected as ceremony without payoff for a three-subcommand tool.
Changes
New code
src/bpetite/_cli.py— argparse withtrain,encode,decodesubcommands wired to the existing
TokenizerAPI. Strict stdout/stderrchannel discipline: machine-readable results (
trainJSON summary,encodecompact JSON array,decoderaw text) go viasys.stdout.writewith no Rich involvement; everything else routes through the shared stderr
Console. The internaltrain_bpefunction is imported directly so theCLI can pass a progress callback without touching the public
Tokenizer.trainsignature (FR-30).src/bpetite/_ui.py— shared presentation module: stderrConsolewithsemantic
Theme,is_fully_interactive()gate (requires both streamsto be TTYs),
render_banner(TTY + width gated),render_kv_box(wraps values in
Text(...)to bypass Rich markup parsing),render_error(escapes dynamic content viarich.markup.escape).src/bpetite/_banner.txt— 13-line ASCII brand banner, loaded lazilyby
_ui.pyviaPath(__file__).parent, ships inside the wheel.Contract-critical details
trainreads the corpus withPath(path).read_bytes().decode(\"utf-8\")instead of
read_text()so CRLF and bare-CR line endings are preserved— universal-newlines translation would silently change the byte sequence
a byte-level tokenizer trains on.
trainpre-flights the--outputpath before any training runs, so"already exists" and "missing parent directory" errors fire in < 100 ms
instead of after minutes of training.
OSErrorafter the specificFileNotFoundError/FileExistsErrorclauses, soIsADirectoryError,PermissionError, and friends produce clean stderr panels instead ofPython tracebacks.
console.printlifecycle lines (
Training started: planned=N,Training merges: K / N,Training complete: merges=M) rather than arich.progress.Progressbar. An earlier implementation using
Progressproduced subtle renderingerrors on zero-merge, early-stop, and invalid-input runs; plain styled
lines sidestep every edge case while still rendering beautifully
through the themed console.
is_fully_interactive()sosubprocess.run(..., capture_output=True)andids=\$(bpetite encode ...)see empty stderr on success, satisfying theresult.stderr == \"\"assertion in the cli-contract skill.Governance updates (in-scope for Task 4-1)
pyproject.toml:rich>=13.0added as a runtime dependency;uv.lockregenerated.
CLAUDE.md,docs/bpetite-prd-v2.md,docs/bpetite-task-list.md:runtime-dep rule updated from "regex only" to "regex and rich"
consistently across all governance sources, with the constraint that
richis scoped to the CLI presentation layer and must never beimported from the core algorithm path.
.claude/skills/rich-cli/SKILL.md: new skill installed, plus acarve-out in the Forbidden Patterns table allowing branded ASCII
banners when TTY-gated, width-gated, and loaded from a shipped asset.
.claude/skills/cli-contract/SKILL.md: Section 6 rewritten to describethe plain-lifecycle-lines architecture (with full rationale for not
using
rich.progress); Section 10 test assertions updated to\"Training started\"+\"Training complete\"; the pre-existing_run_training/(int, int)callback description was also correctedto match the actual
train_bpe/ProgressEventsignature from_trainer.py..claude/skills/bpetite-conventions/SKILL.mdand.claude/skills/task-executor/SKILL.md: runtime-dep rule updated tomatch.
Deleted:
BPETITE_ASCII.txtat the repo root — moved intosrc/bpetite/_banner.txtso the banner ships with the wheel and isloadable from the installed package.
Validation
uv run pytest— 180 passeduv run ruff check .— cleanuv run ruff format --check .— cleanuv run mypy --strict— cleanAcceptance criteria verified one-by-one against the real CLI:
trainwrites progress updates only to stderr — PASS (lifecycle lineson stderr, JSON summary on stdout)
trainwrites a machine-readable JSON summary only to stdout — PASS(exactly the five required keys,
sys.stdout.writepath)encodewrites a compact JSON array only to stdout — PASS(
[104,105]no spaces, stderr 0 bytes on captured runs)decodewrites raw decoded text only to stdout — PASS(
end=\"\"equivalent, stderr 0 bytes on captured runs)decoded bytes fail non-zero and write only to stderr — PASS
(10-case matrix including directory-path
OSErrors, markup-bearingpaths, CRLF inputs, and the
Invalid vocab sizepath)Review history
This PR went through five rounds of Codex review during implementation.
Each round's findings were verified and fixed in the same PR:
is_fully_interactive(). Save errors fired after training →pre-flight added.
Path.read_text()stripped CRLF → switched toread_bytes().decode(\"utf-8\").IsADirectoryError/PermissionErrorescaped as tracebacks →
except OSErroradded across three helpers.is_terminalgate only checked stderr → extended to both streams.Rich markup parser ate bracket-bearing paths →
Text(...)wrap in_kv_tableandrich.markup.escapeinrender_error. Early-stopbar stuck at partial progress →
total=merges_completedon completeevent.
completion → lazy task creation.
→ dropped
rich.progress.Progressentirely in favor of plainstyled lifecycle lines.
Interactive smoke-testing by hand confirmed the banner, panels, progress
lines, and error panels render as designed across
train,encode,and
decode.Risks / Follow-ups
tests/test_cli.pywill subprocess the installed entry point andassert the full stdout/stderr contract per the updated
cli-contractskill Section 10.
cli-smokeanddeterminism) have been no-op as
pre-cli-mainsince PR test(fixtures): add deterministic corpora and shared conftest #11. Worthverifying both flip to real checks when this PR lands on
main.feat(tokenizer): add public Tokenizer class wiring phase-3 layers #28-30), the 4-1 strikethrough will land as its own docs commit
after the main PR merges.