feat(cli): implement train/encode/decode with rich stderr UX by dinesh-git17 · Pull Request #32 · dinesh-git17/bpetite

dinesh-git17 · 2026-04-15T07:22:25Z

Summary

Implements Task 4-1: the bpetite CLI with train, encode, and decode
subcommands. Rich powers the stderr presentation (ASCII banner, config /
completion panels, styled lifecycle lines, error panels) while
sys.stdout.write holds the stdout contract byte-for-byte (FR-33 / FR-34).
The banner and interactive panels are TTY-gated via is_fully_interactive()
so machine consumers see clean stdout and an empty success-stderr.

Why

Phase 4 kickoff. With Phase 3 closed, the core library has a settled public
API; this PR exposes it behind a user-facing entry point. The "beautiful
Rich UX" question deferred during Phases 1-3 was resolved here: rich is
adopted as a runtime dep alongside regex; structlog was explicitly
rejected as ceremony without payoff for a three-subcommand tool.

Changes

New code

src/bpetite/_cli.py — argparse with train, encode, decode
subcommands wired to the existing Tokenizer API. Strict stdout/stderr
channel discipline: machine-readable results (train JSON summary,
encode compact JSON array, decode raw text) go via sys.stdout.write
with no Rich involvement; everything else routes through the shared stderr
Console. The internal train_bpe function is imported directly so the
CLI can pass a progress callback without touching the public
Tokenizer.train signature (FR-30).
src/bpetite/_ui.py — shared presentation module: stderr Console with
semantic Theme, is_fully_interactive() gate (requires both streams
to be TTYs), render_banner (TTY + width gated), render_kv_box
(wraps values in Text(...) to bypass Rich markup parsing),
render_error (escapes dynamic content via rich.markup.escape).
src/bpetite/_banner.txt — 13-line ASCII brand banner, loaded lazily
by _ui.py via Path(__file__).parent, ships inside the wheel.

Contract-critical details

train reads the corpus with Path(path).read_bytes().decode(\"utf-8\")
instead of read_text() so CRLF and bare-CR line endings are preserved
— universal-newlines translation would silently change the byte sequence
a byte-level tokenizer trains on.
train pre-flights the --output path before any training runs, so
"already exists" and "missing parent directory" errors fire in < 100 ms
instead of after minutes of training.
All filesystem error paths catch OSError after the specific
FileNotFoundError / FileExistsError clauses, so IsADirectoryError,
PermissionError, and friends produce clean stderr panels instead of
Python tracebacks.
Training progress is rendered as three plain styled console.print
lifecycle lines (Training started: planned=N, Training merges: K / N,
Training complete: merges=M) rather than a rich.progress.Progress
bar. An earlier implementation using Progress produced subtle rendering
errors on zero-merge, early-stop, and invalid-input runs; plain styled
lines sidestep every edge case while still rendering beautifully
through the themed console.
Encode / decode panels are gated on is_fully_interactive() so
subprocess.run(..., capture_output=True) and
ids=\$(bpetite encode ...) see empty stderr on success, satisfying the
result.stderr == \"\" assertion in the cli-contract skill.

Governance updates (in-scope for Task 4-1)

pyproject.toml: rich>=13.0 added as a runtime dependency; uv.lock
regenerated.
CLAUDE.md, docs/bpetite-prd-v2.md, docs/bpetite-task-list.md:
runtime-dep rule updated from "regex only" to "regex and rich"
consistently across all governance sources, with the constraint that
rich is scoped to the CLI presentation layer and must never be
imported from the core algorithm path.
.claude/skills/rich-cli/SKILL.md: new skill installed, plus a
carve-out in the Forbidden Patterns table allowing branded ASCII
banners when TTY-gated, width-gated, and loaded from a shipped asset.
.claude/skills/cli-contract/SKILL.md: Section 6 rewritten to describe
the plain-lifecycle-lines architecture (with full rationale for not
using rich.progress); Section 10 test assertions updated to
\"Training started\" + \"Training complete\"; the pre-existing
_run_training / (int, int) callback description was also corrected
to match the actual train_bpe / ProgressEvent signature from
_trainer.py.
.claude/skills/bpetite-conventions/SKILL.md and
.claude/skills/task-executor/SKILL.md: runtime-dep rule updated to
match.

Deleted:

BPETITE_ASCII.txt at the repo root — moved into
src/bpetite/_banner.txt so the banner ships with the wheel and is
loadable from the installed package.

Validation

uv run pytest — 180 passed
uv run ruff check . — clean
uv run ruff format --check . — clean
uv run mypy --strict — clean

Acceptance criteria verified one-by-one against the real CLI:

train writes progress updates only to stderr — PASS (lifecycle lines
on stderr, JSON summary on stdout)
train writes a machine-readable JSON summary only to stdout — PASS
(exactly the five required keys, sys.stdout.write path)
encode writes a compact JSON array only to stdout — PASS
([104,105] no spaces, stderr 0 bytes on captured runs)
decode writes raw decoded text only to stdout — PASS
(end=\"\" equivalent, stderr 0 bytes on captured runs)
Missing files, invalid UTF-8 inputs, unknown token IDs, and invalid
decoded bytes fail non-zero and write only to stderr — PASS
(10-case matrix including directory-path OSErrors, markup-bearing
paths, CRLF inputs, and the Invalid vocab size path)

Review history

This PR went through five rounds of Codex review during implementation.
Each round's findings were verified and fixed in the same PR:

Stderr decoration leaked on encode/decode success → gated on
is_fully_interactive(). Save errors fired after training →
pre-flight added.
Path.read_text() stripped CRLF → switched to
read_bytes().decode(\"utf-8\"). IsADirectoryError / PermissionError
escaped as tracebacks → except OSError added across three helpers.
is_terminal gate only checked stderr → extended to both streams.
Rich markup parser ate bracket-bearing paths → Text(...) wrap in
_kv_table and rich.markup.escape in render_error. Early-stop
bar stuck at partial progress → total=merges_completed on complete
event.
Stale progress task on invalid vocab and 0/0 bar on zero-merge
completion → lazy task creation.
Start and zero-merge completion events had no visible stderr output
→ dropped rich.progress.Progress entirely in favor of plain
styled lifecycle lines.

Interactive smoke-testing by hand confirmed the banner, panels, progress
lines, and error panels render as designed across train, encode,
and decode.

Risks / Follow-ups

Task 4-2 (CLI contract tests) is the next deliverable.
tests/test_cli.py will subprocess the installed entry point and
assert the full stdout/stderr contract per the updated cli-contract
skill Section 10.
CI gates — two phased-rollout required gates (cli-smoke and
determinism) have been no-op as pre-cli-main since PR test(fixtures): add deterministic corpora and shared conftest #11. Worth
verifying both flip to real checks when this PR lands on main.
Task list closure — following the Phase 3 pattern (docs(tasks): close phase 3 task list with strikethroughs #31 after
feat(tokenizer): add public Tokenizer class wiring phase-3 layers #28-30), the 4-1 strikethrough will land as its own docs commit
after the main PR merges.

github-actions · 2026-04-15T07:22:47Z

bpetite workflows

Workflow	Status	Comment if failure and where
tests	success	ok
lint	success	ok
syntax	success	ok
format	success	ok
types	success	ok
build	success	ok
cli-smoke	success	ok
determinism	success	ok
policy-guard	success	ok
ci-meta	pending	waiting

PR #32: tracked workflows are still running.

feat(cli): implement train/encode/decode with rich stderr UX

1c742ad

github-actions Bot added area/core area/cli area/docs type/feat automation/dependencies labels Apr 15, 2026

dinesh-git17 merged commit bc4b860 into main Apr 15, 2026
16 checks passed

dinesh-git17 deleted the feat/cli-train-encode-decode branch April 15, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): implement train/encode/decode with rich stderr UX#32

feat(cli): implement train/encode/decode with rich stderr UX#32
dinesh-git17 merged 1 commit into
mainfrom
feat/cli-train-encode-decode

dinesh-git17 commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dinesh-git17 commented Apr 15, 2026

Summary

Why

Changes

New code

Contract-critical details

Governance updates (in-scope for Task 4-1)

Validation

Review history

Risks / Follow-ups

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bpetite workflows

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 15, 2026 •

edited

Loading