Skip to content

feat(cli): implement train/encode/decode with rich stderr UX#32

Merged
dinesh-git17 merged 1 commit into
mainfrom
feat/cli-train-encode-decode
Apr 15, 2026
Merged

feat(cli): implement train/encode/decode with rich stderr UX#32
dinesh-git17 merged 1 commit into
mainfrom
feat/cli-train-encode-decode

Conversation

@dinesh-git17

Copy link
Copy Markdown
Owner

Summary

Implements Task 4-1: the bpetite CLI with train, encode, and decode
subcommands. Rich powers the stderr presentation (ASCII banner, config /
completion panels, styled lifecycle lines, error panels) while
sys.stdout.write holds the stdout contract byte-for-byte (FR-33 / FR-34).
The banner and interactive panels are TTY-gated via is_fully_interactive()
so machine consumers see clean stdout and an empty success-stderr.

Why

Phase 4 kickoff. With Phase 3 closed, the core library has a settled public
API; this PR exposes it behind a user-facing entry point. The "beautiful
Rich UX" question deferred during Phases 1-3 was resolved here: rich is
adopted as a runtime dep alongside regex; structlog was explicitly
rejected as ceremony without payoff for a three-subcommand tool.

Changes

New code

  • src/bpetite/_cli.py — argparse with train, encode, decode
    subcommands wired to the existing Tokenizer API. Strict stdout/stderr
    channel discipline: machine-readable results (train JSON summary,
    encode compact JSON array, decode raw text) go via sys.stdout.write
    with no Rich involvement; everything else routes through the shared stderr
    Console. The internal train_bpe function is imported directly so the
    CLI can pass a progress callback without touching the public
    Tokenizer.train signature (FR-30).
  • src/bpetite/_ui.py — shared presentation module: stderr Console with
    semantic Theme, is_fully_interactive() gate (requires both streams
    to be TTYs), render_banner (TTY + width gated), render_kv_box
    (wraps values in Text(...) to bypass Rich markup parsing),
    render_error (escapes dynamic content via rich.markup.escape).
  • src/bpetite/_banner.txt — 13-line ASCII brand banner, loaded lazily
    by _ui.py via Path(__file__).parent, ships inside the wheel.

Contract-critical details

  • train reads the corpus with Path(path).read_bytes().decode(\"utf-8\")
    instead of read_text() so CRLF and bare-CR line endings are preserved
    — universal-newlines translation would silently change the byte sequence
    a byte-level tokenizer trains on.
  • train pre-flights the --output path before any training runs, so
    "already exists" and "missing parent directory" errors fire in < 100 ms
    instead of after minutes of training.
  • All filesystem error paths catch OSError after the specific
    FileNotFoundError / FileExistsError clauses, so IsADirectoryError,
    PermissionError, and friends produce clean stderr panels instead of
    Python tracebacks.
  • Training progress is rendered as three plain styled console.print
    lifecycle lines (Training started: planned=N, Training merges: K / N,
    Training complete: merges=M) rather than a rich.progress.Progress
    bar. An earlier implementation using Progress produced subtle rendering
    errors on zero-merge, early-stop, and invalid-input runs; plain styled
    lines sidestep every edge case while still rendering beautifully
    through the themed console.
  • Encode / decode panels are gated on is_fully_interactive() so
    subprocess.run(..., capture_output=True) and
    ids=\$(bpetite encode ...) see empty stderr on success, satisfying the
    result.stderr == \"\" assertion in the cli-contract skill.

Governance updates (in-scope for Task 4-1)

  • pyproject.toml: rich>=13.0 added as a runtime dependency; uv.lock
    regenerated.
  • CLAUDE.md, docs/bpetite-prd-v2.md, docs/bpetite-task-list.md:
    runtime-dep rule updated from "regex only" to "regex and rich"
    consistently across all governance sources, with the constraint that
    rich is scoped to the CLI presentation layer and must never be
    imported from the core algorithm path.
  • .claude/skills/rich-cli/SKILL.md: new skill installed, plus a
    carve-out in the Forbidden Patterns table allowing branded ASCII
    banners when TTY-gated, width-gated, and loaded from a shipped asset.
  • .claude/skills/cli-contract/SKILL.md: Section 6 rewritten to describe
    the plain-lifecycle-lines architecture (with full rationale for not
    using rich.progress); Section 10 test assertions updated to
    \"Training started\" + \"Training complete\"; the pre-existing
    _run_training / (int, int) callback description was also corrected
    to match the actual train_bpe / ProgressEvent signature from
    _trainer.py.
  • .claude/skills/bpetite-conventions/SKILL.md and
    .claude/skills/task-executor/SKILL.md: runtime-dep rule updated to
    match.

Deleted:

  • BPETITE_ASCII.txt at the repo root — moved into
    src/bpetite/_banner.txt so the banner ships with the wheel and is
    loadable from the installed package.

Validation

  • uv run pytest — 180 passed
  • uv run ruff check . — clean
  • uv run ruff format --check . — clean
  • uv run mypy --strict — clean

Acceptance criteria verified one-by-one against the real CLI:

  1. train writes progress updates only to stderr — PASS (lifecycle lines
    on stderr, JSON summary on stdout)
  2. train writes a machine-readable JSON summary only to stdout — PASS
    (exactly the five required keys, sys.stdout.write path)
  3. encode writes a compact JSON array only to stdout — PASS
    ([104,105] no spaces, stderr 0 bytes on captured runs)
  4. decode writes raw decoded text only to stdout — PASS
    (end=\"\" equivalent, stderr 0 bytes on captured runs)
  5. Missing files, invalid UTF-8 inputs, unknown token IDs, and invalid
    decoded bytes fail non-zero and write only to stderr — PASS
    (10-case matrix including directory-path OSErrors, markup-bearing
    paths, CRLF inputs, and the Invalid vocab size path)

Review history

This PR went through five rounds of Codex review during implementation.
Each round's findings were verified and fixed in the same PR:

  1. Stderr decoration leaked on encode/decode success → gated on
    is_fully_interactive(). Save errors fired after training →
    pre-flight added.
  2. Path.read_text() stripped CRLF → switched to
    read_bytes().decode(\"utf-8\"). IsADirectoryError / PermissionError
    escaped as tracebacks → except OSError added across three helpers.
  3. is_terminal gate only checked stderr → extended to both streams.
    Rich markup parser ate bracket-bearing paths → Text(...) wrap in
    _kv_table and rich.markup.escape in render_error. Early-stop
    bar stuck at partial progress → total=merges_completed on complete
    event.
  4. Stale progress task on invalid vocab and 0/0 bar on zero-merge
    completion → lazy task creation.
  5. Start and zero-merge completion events had no visible stderr output
    → dropped rich.progress.Progress entirely in favor of plain
    styled lifecycle lines.

Interactive smoke-testing by hand confirmed the banner, panels, progress
lines, and error panels render as designed across train, encode,
and decode.

Risks / Follow-ups

@github-actions

github-actions Bot commented Apr 15, 2026

Copy link
Copy Markdown

bpetite workflows

Workflow Status Comment if failure and where
tests success ok
lint success ok
syntax success ok
format success ok
types success ok
build success ok
cli-smoke success ok
determinism success ok
policy-guard success ok
ci-meta pending waiting

PR #32: tracked workflows are still running.

@dinesh-git17 dinesh-git17 merged commit bc4b860 into main Apr 15, 2026
16 checks passed
@dinesh-git17 dinesh-git17 deleted the feat/cli-train-encode-decode branch April 15, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant