diff --git a/.claude/skills/bpetite-conventions/SKILL.md b/.claude/skills/bpetite-conventions/SKILL.md index 2272e53..82878a1 100644 --- a/.claude/skills/bpetite-conventions/SKILL.md +++ b/.claude/skills/bpetite-conventions/SKILL.md @@ -155,7 +155,7 @@ Each file is created (or primarily modified) by a specific task. Before editing ### Dependencies -- `regex` is the only runtime dependency beyond the standard library. +- `regex` and `rich` are the only runtime dependencies beyond the standard library. `regex` powers the pre-tokenizer; `rich` powers the CLI presentation layer (stderr-only) and does not touch the core algorithm or the public `Tokenizer` API. - `tiktoken` is declared as a dev-only dependency. It must never appear in the core library or CLI runtime import path. ### Typing and bytes diff --git a/.claude/skills/cli-contract/SKILL.md b/.claude/skills/cli-contract/SKILL.md index 23ce687..e7ca9a2 100644 --- a/.claude/skills/cli-contract/SKILL.md +++ b/.claude/skills/cli-contract/SKILL.md @@ -10,6 +10,14 @@ This skill encodes all non-negotiable rules for `src/bpetite/_cli.py` and and the task-list acceptance criteria for Tasks 4-1 and 4-2. Deviating from any of these rules will produce a test failure or a broken reviewer experience. +**Pair with `rich-cli`.** This skill governs the *contract* (which channel, +which format, which exit code). The `rich-cli` skill governs the *presentation* +(themes, progress bars, panels, error rendering). Load both when touching +`_cli.py`. When the two appear to conflict, the contract wins — any Rich +`Console` that writes to stdout must instead be constructed with `stderr=True`, +and every machine-readable result must still be written via raw +`sys.stdout.write` or `print` with no Rich markup. + --- ## 1. Channel Discipline — The One Rule That Breaks Everything @@ -122,23 +130,84 @@ integers. ## 6. Training Progress Output -Progress must be written to stderr only, never stdout. The callback-based -approach is described in Section 7; this section covers the output format. - -Required progress events and their stderr format: - -``` -Training started: vocab_size=512, corpus_bytes=1234567 -Merges completed: 50 -Merges completed: 100 -Merges completed: 150 -... -Training complete: actual_mergeable_vocab_size=512, elapsed_ms=4201.33 -``` - -Emit a "Merges completed" line at start (0), every 100 merges, and at -completion. The exact wording is flexible but must be human-readable and -grep-friendly. Do not use JSON for progress lines. +Progress is written to stderr only, never stdout. The `train` subcommand +emits three kinds of stderr output, in order: + +1. A **configuration panel** before training begins, titled `"Training"`, + containing a key/value grid with the input path, requested vocab size, + output path, and the `--force` flag state. Rendered via + `render_kv_box` on the shared stderr `Console` from `_ui.py`. + +2. **Plain styled lifecycle lines** driven by the + `_trainer.train_bpe` callback. The callback receives a + `ProgressEvent` with `kind` in `{"start", "merge", "complete"}`, + `merges_completed`, and `merges_planned`. Each event maps to exactly + one `console.print` line: + + - `kind="start"` — fires once before the merge loop begins. The line + is `"Training started: planned={merges_planned}"` with `info` style. + This is the load-bearing update for the `start` bullet of the + task-list progress rule. + - `kind="merge"` — fires every 100 completed merges. The line is + `"Training merges: {merges_completed} / {merges_planned}"` with + `info` style. This is the `every 100 merges completed` bullet. + - `kind="complete"` — fires once after the loop exits, whether the + run completed normally or early-stopped. The line is + `"Training complete: merges={merges_completed}"` with `success` + style. This is the `completion` bullet. + + Plain lines, not a Rich `Progress` bar, are used intentionally. An + earlier design used `rich.progress.Progress` with a lazy `TaskID` and + a `transient=False` bar. That design produced subtle rendering errors + on three edge cases: invalid `--vocab-size` left a half-initialized + task behind when `train_bpe` raised before emitting any event; + zero-merge runs (`--vocab-size 256` or an empty corpus) either + flashed a `0/N` bar or rendered nothing at all; early-stop runs + fought with `remove_task` and total-shrinking. Plain `console.print` + lines sidestep every edge case while still rendering beautifully + through the themed console. + +3. A **completion panel** after the lifecycle lines, titled + `"Training complete"`, containing the corpus bytes, requested vocab + size, actual mergeable vocab size, special-token count, elapsed ms, + and the saved path. Rendered via `render_kv_box` with a green + border. + +Rules for this surface: + +- Every stderr render must go through the shared `Console` from `_ui.py` + (the panels via `render_kv_box` and `render_error`, the lifecycle + lines via `console.print`). No Rich output may be constructed against + `stdout` or the default console. +- No bare `print()` calls anywhere. The panels, the lifecycle lines via + `console.print`, and the error surface are the only stderr rendering + paths. +- No JSON for any progress event. The stdout contract in Section 5 is + the only JSON surface. +- Do not reintroduce a live `Progress` bar. The three lifecycle lines + are the contract. If richer mid-training feedback is ever needed, + revisit this section first. + +Tests that assert on the progress surface should match stable text that +appears in the rendered output regardless of terminal width, ANSI escape +presence, or timing. The recommended substrings are: + +- `"Training started"` — appears on the start lifecycle line (and + nowhere else), so it is the cleanest signal that the start event was + emitted. +- `"Training complete"` — appears both on the complete lifecycle line + and in the completion-panel title. Presence confirms the run reached + completion. +- `"Training merges"` — appears on each per-100 merge line. Presence + confirms the mid-training event fired at least once; absence is + normal for runs that completed fewer than 100 merges. +- Human-readable field labels from the completion panel such as + `"Corpus bytes"`, `"Actual mergeable vocab size"`, `"Elapsed"`, and + `"Saved to"`. + +Do not assert on exact ANSI byte sequences, panel border characters, +semantic style names, or elapsed timings — those are unstable across +terminals, Rich versions, and run-to-run timing. --- @@ -152,55 +221,63 @@ The callback is threaded through internal functions only. ### Internal trainer function signature -In `src/bpetite/_trainer.py`, the internal training entry point accepts an -optional callback: +In `src/bpetite/_trainer.py`, the internal training entry point is +`train_bpe` and accepts an optional progress callback via the keyword-only +`progress` parameter: ```python -from typing import Callable +from bpetite._trainer import ProgressCallback, ProgressEvent, train_bpe -def _run_training( +def train_bpe( corpus: str, vocab_size: int, - progress_callback: Callable[[int, int], None] | None = None, -) -> tuple[dict[int, bytes], list[tuple[int, int]], dict[str, int]]: - ... + *, + progress: ProgressCallback | None = None, +) -> TrainerResult: ... ``` -The callback signature is `(merges_done: int, target_merges: int) -> None`. -The caller (CLI) provides the callback; the library default is `None`. +The callback signature is `Callable[[ProgressEvent], None]` where +`ProgressEvent` is a frozen dataclass with three fields: + +- `kind: Literal["start", "merge", "complete"]` +- `merges_completed: int` +- `merges_planned: int` + +`"start"` fires once before the merge loop, `"merge"` fires every 100 +completed merges, and `"complete"` fires once after the loop exits (which +may include an early stop, leaving `merges_completed < merges_planned`). ### Wiring in `_cli.py` ```python -import sys, time +import sys from bpetite import Tokenizer -from bpetite._trainer import _run_training # internal import — CLI only +from bpetite._trainer import ProgressEvent, train_bpe # internal — CLI only -def _progress(done: int, total: int) -> None: - if done == 0: - print(f"Training started: target_merges={total}", file=sys.stderr) - elif done % 100 == 0: - print(f"Merges completed: {done}", file=sys.stderr) +def _on_event(event: ProgressEvent) -> None: + # Route progress updates to your stderr renderer (Rich Progress bar, + # plain stderr prints — whatever the presentation layer calls for). + ... -# CLI calls the internal function directly to inject the callback, -# then constructs a Tokenizer from the returned state. +result = train_bpe(corpus, vocab_size, progress=_on_event) +tokenizer = Tokenizer( + vocab=dict(result.vocab), + merges=list(result.merges), + special_tokens=dict(result.special_tokens), +) ``` -The public `Tokenizer.train` calls `_run_training` internally with -`progress_callback=None`. The CLI bypasses the public method and calls -`_run_training` directly, then wraps the result in a `Tokenizer` instance -via a private constructor or `load`-equivalent path. - -If the `Tokenizer` class has a private `_from_state` classmethod or -equivalent for constructing from raw vocab/merges/special_tokens, use that. -If it does not exist yet, add it as an internal helper (underscore-prefixed, -not part of the public API contract). +The public `Tokenizer.train` calls `train_bpe` internally with +`progress=None`. The CLI bypasses `Tokenizer.train`, calls `train_bpe` +directly with a callback, and then constructs a `Tokenizer` via the +existing `__init__` which already accepts the raw vocab/merges/special-token +state — no separate `_from_state` classmethod is required. ### Why the callback cannot go on `Tokenizer.train` -FR-30 enumerates exactly five public methods. Adding `progress_callback` to -`Tokenizer.train` would change the public API contract. The CLI is the only -caller that needs progress output; the library should remain clean. +FR-30 enumerates exactly five public methods. Adding a `progress` parameter +to `Tokenizer.train` would change the public API contract. The CLI is the +only caller that needs progress output; the library should remain clean. --- @@ -333,8 +410,15 @@ assert len(result.stderr) > 0 # error message is on stderr - `decode` unknown token ID: returncode nonzero, stdout empty. - `decode` token sequence producing invalid UTF-8: returncode nonzero, stdout empty. -- `train` progress lines appear on stderr (check `"Merges"` or equivalent - substring in stderr when training completes at least one merge). +- `train` progress surface appears on stderr. Assert that stderr contains + both `"Training started"` (from the start lifecycle line, confirming + the start event fired) and `"Training complete"` (from the complete + lifecycle line and the completion panel title, confirming the run + reached the end) after a successful run. These two substrings are the + contract-level evidence that start and completion progress updates + appeared on stderr. Do not assert on ANSI escape sequences, panel + border glyphs, semantic style names, or elapsed timings — those are + unstable across terminals and Rich versions. - `encode` output uses compact separators (no spaces in JSON array). --- @@ -352,5 +436,5 @@ Run through this mentally before calling a CLI implementation done: - [ ] `--force` maps to `overwrite=args.force` on `save`. - [ ] All known exception types are caught and result in `sys.exit(1)`. - [ ] argparse errors exit with code 2 (default, do not override). -- [ ] `progress_callback` is not on `Tokenizer.train`'s public signature. -- [ ] CLI imports `_run_training` from `bpetite._trainer` for callback wiring. +- [ ] `progress` is not on `Tokenizer.train`'s public signature. +- [ ] CLI imports `train_bpe` and `ProgressEvent` from `bpetite._trainer` for callback wiring. diff --git a/.claude/skills/rich-cli/SKILL.md b/.claude/skills/rich-cli/SKILL.md new file mode 100644 index 0000000..c09a14a --- /dev/null +++ b/.claude/skills/rich-cli/SKILL.md @@ -0,0 +1,362 @@ +--- +name: rich-cli +description: "Design beautiful, developer-friendly Python CLI output using Rich only. Activate this skill whenever you are writing Python code that involves terminal output, console presentation, CLI visual layout, progress reporting, status messages, warning or error display, tables, panels, banners, or any visual formatting in a command-line tool. Trigger immediately on any Rich import, console.print usage, requests to make the output nicer, add a progress bar, show a status message, display results as a table, add color to the CLI, format the terminal output, or any work touching how a Python tool communicates visually with the user. Do NOT consult this skill for Textual TUIs, prompt_toolkit interactive prompts, or curses-based screen management. Do consult it for everything else terminal-presentation-related in Python." +--- + +# Rich CLI Design Skill + +You are a senior Python CLI engineer with strong design taste. When this skill is active, your job is to produce terminal output that a thoughtful senior engineer would be genuinely pleased to use every day. That means output that is clean, structured, readable, and visually calm — not output that shows off. + +**Hard constraint**: Use **Rich only** for terminal presentation. No Textual, Blessed, prompt_toolkit (for visuals), curses, or mixed libraries. + +--- + +## Design Philosophy + +Great CLI output follows a clear hierarchy of values: + +1. **Readable before decorative** — if decoration reduces clarity, remove it +2. **Structured before dense** — group related information; never dump walls of text +3. **Consistent before novel** — visual conventions must hold across every command and state +4. **Calm before clever** — the terminal should feel like a professional tool, not a demo reel +5. **Signal before style** — color and borders exist to improve comprehension, not to fill space + +The bar is a serious internal developer tool that happens to have excellent taste. Think of tools like `uv`, `cargo build`, or `gh` — clear, fast, honest, occasionally beautiful. + +--- + +## Rich Component Guide + +### Console — the single source of truth + +Always create one shared `Console` instance for the application, ideally in a dedicated `ui.py` or `console.py` module. Never instantiate `Console` multiple times across the codebase. + +```python +from rich.console import Console +from rich.theme import Theme + +THEME = Theme({ + "info": "cyan", + "success": "bold green", + "warning": "bold yellow", + "error": "bold red", + "muted": "dim white", + "heading": "bold white", +}) + +console = Console(theme=THEME) +``` + +Always use **semantic theme names** in markup (`[success]`, `[warning]`), never raw color strings (`[bold green]`). This is the difference between a maintainable codebase and a color-spelunking nightmare. + +--- + +### Theme — semantic color mapping + +Map colors to meaning, not to decoration. The canonical mapping: + +| Semantic name | Meaning | Suggested style | +| ------------- | ----------------------------------- | --------------- | +| `success` | Operation completed successfully | `bold green` | +| `warning` | Non-fatal issue, user should notice | `bold yellow` | +| `error` | Operation failed | `bold red` | +| `info` | Neutral informational message | `cyan` | +| `muted` | Secondary, supporting detail | `dim white` | +| `heading` | Section or panel title | `bold white` | + +Never assign colors arbitrarily. If a color is on screen, it must carry meaning. + +--- + +### Panel — use sparingly and purposefully + +Panels are for important, bounded messages: errors, completion summaries, configuration previews. They are not for every output line. + +**Good panel usage:** + +- Final success or failure summary +- Error with context (what failed, why, what to do) +- Configuration or environment preview +- Validation report + +**Do not use panels for:** + +- Simple status messages +- Individual log lines +- Informational text that reads fine as prose +- Decorating output that does not need a border + +```python +# Correct: panel for a meaningful boundary +from rich.panel import Panel + +console.print(Panel( + "[success]Build completed[/success]\n[muted]3 files written to dist/[/muted]", + title="[heading]Done[/heading]", + border_style="green", + padding=(1, 2), +)) + +# Wrong: panel for a simple message +console.print(Panel("Loading config...")) # just use console.print +``` + +Prefer `ROUNDED` or `SIMPLE` border styles. Avoid `HEAVY` and `DOUBLE` — they look aggressive and add noise without adding information. + +--- + +### Table — structured data only + +Use `Table` when presenting two or more columns of data that benefit from alignment. Never use a table for a single list of values (a `Rule` + plain lines are cleaner). Never use a table to show fewer than three rows unless alignment is genuinely helpful. + +```python +from rich.table import Table + +table = Table( + title="Validation Results", + border_style="dim", + show_lines=False, # reduce visual noise + header_style="heading", +) +table.add_column("File", style="muted", no_wrap=True) +table.add_column("Status", justify="center") +table.add_column("Issues", justify="right", style="muted") + +table.add_row("auth.py", "[success]Pass[/success]", "0") +table.add_row("config.py", "[warning]Warn[/warning]", "2") +table.add_row("api.py", "[error]Fail[/error]", "5") + +console.print(table) +``` + +Keep column count to what the engineer actually needs. Default to `show_lines=False` (row separators add visual noise in most cases). Use `justify="right"` for numeric columns. + +--- + +### Progress — for trackable work + +Use `Progress` when the total number of steps is known or estimable. Compose only the columns that are meaningful. + +```python +from rich.progress import ( + Progress, SpinnerColumn, BarColumn, + TextColumn, TimeRemainingColumn, TaskProgressColumn, +) + +with Progress( + SpinnerColumn(), + TextColumn("[info]{task.description}[/info]"), + BarColumn(bar_width=40), + TaskProgressColumn(), + TimeRemainingColumn(), + console=console, +) as progress: + task = progress.add_task("Compiling modules", total=len(modules)) + for module in modules: + compile_module(module) + progress.advance(task) +``` + +Rules for progress bars: + +- Always pass the application `console` instance so output does not split +- Always include a human-readable description on the task (not just "Processing...") +- Use `TimeRemainingColumn` when total work is known; omit it for indeterminate tasks +- Use `track()` as a shorthand only for simple single-task loops + +--- + +### Status — for indeterminate waits + +Use `console.status()` when you cannot estimate completion time. A spinner communicates "something is happening" without lying about duration. + +```python +with console.status("[info]Fetching remote config...[/info]", spinner="dots"): + config = fetch_config() +``` + +Prefer the `dots` or `line` spinner styles. Avoid `bouncingBar`, `pong`, or similar playful styles — they look juvenile in serious tools. Do not keep the status active after the operation completes; exit the context manager promptly and follow with a completion message. + +--- + +### Rule — visual section separators + +`Rule` is a clean, lightweight way to separate sections without the weight of a Panel. + +```python +from rich.rule import Rule + +console.print(Rule("[heading]Build Phase[/heading]", style="dim")) +``` + +Use `Rule` to separate logical phases of output. Keep titles short and purposeful. Do not use Rules to decorate every few lines — they lose meaning when overused. + +--- + +### Columns — horizontal layout + +`Columns` is useful for presenting parallel panels (like a dashboard) or lists of items side by side. Use it rarely and only when horizontal grouping genuinely aids scanning. + +```python +from rich.columns import Columns +from rich.panel import Panel + +metrics = [ + Panel("[success]42[/success]", title="Tests passed"), + Panel("[warning]3[/warning]", title="Warnings"), + Panel("[error]0[/error]", title="Errors"), +] +console.print(Columns(metrics, equal=True, expand=True)) +``` + +--- + +### Error and Warning Presentation + +Errors and warnings deserve structure because the user needs to act on them. Always provide: + +1. What happened (the state) +2. Where it happened (file, line, context — if applicable) +3. Why it matters (consequence) +4. What to do (resolution path, if you know it) + +```python +# Error panel pattern +console.print(Panel( + "[error]Could not connect to database.[/error]\n\n" + "[muted]Host:[/muted] localhost:5432\n" + "[muted]Reason:[/muted] Connection refused\n\n" + "[info]Check that the database is running and DATABASE_URL is set correctly.[/info]", + title="[error]Connection Failed[/error]", + border_style="red", + padding=(1, 2), +)) + +# Warning inline pattern (no panel needed for non-fatal warnings) +console.print("[warning]Warning:[/warning] config.yaml is missing 'timeout' — defaulting to 30s") +``` + +Use a Panel for errors that stop execution. Use an inline message for warnings that allow execution to continue. + +--- + +### Logging Integration + +When the application uses Python's `logging` module, replace the default handler with `RichHandler`: + +```python +import logging +from rich.logging import RichHandler + +logging.basicConfig( + level=logging.INFO, + format="%(message)s", + handlers=[RichHandler(console=console, show_path=False, rich_tracebacks=True)], +) +log = logging.getLogger("myapp") +``` + +Always pass the shared `console` instance. Set `show_path=False` for cleaner log lines in production tools; enable it in debug modes. + +--- + +## Presentation Architecture + +Separate presentation from business logic. The pattern that scales: + +``` +myapp/ + cli.py # argument parsing (Click, Typer, argparse) + ui.py # Console instance, Theme, shared render helpers + commands/ + build.py # business logic + calls to ui helpers + deploy.py +``` + +Write reusable render helpers for patterns that appear more than once: + +```python +# ui.py +def print_success(message: str, detail: str = "") -> None: + body = f"[success]{message}[/success]" + if detail: + body += f"\n[muted]{detail}[/muted]" + console.print(Panel(body, border_style="green", padding=(0, 2))) + +def print_error(message: str, hint: str = "") -> None: + body = f"[error]{message}[/error]" + if hint: + body += f"\n[info]{hint}[/info]" + console.print(Panel(body, title="[error]Error[/error]", border_style="red", padding=(1, 2))) +``` + +Business logic modules call helpers — they do not manipulate Rich objects directly. + +--- + +## Forbidden Patterns + +Never produce the following: + +| Pattern | Why it is wrong | +| -------------------------------------------------- | ---------------------------------------- | +| Multiple `Console()` instances | Output splits; styles conflict | +| Raw color strings in markup (`[bold green]`) | Unmaintainable; breaks theming | +| Panel for every output line | Visual noise; panels lose meaning | +| Deeply nested panels | Illegible; fighting the terminal width | +| Rainbow color usage (6+ distinct colors on screen) | Destroys the signal-to-noise ratio | +| `console.print` inside business logic | Breaks separation of concerns | +| Progress bar with fake/static total | Misleads the user | +| Emojis as primary status indicators | Breaks on some terminals; looks juvenile | +| Walls of unstructured colored text | No better than raw print statements | +| Gratuitous banners or ASCII art headers | Wastes vertical space; looks amateurish | + +**Exception — branded banners.** A single ASCII-art banner that represents +the tool's own brand identity (not decoration, not clip art) is allowed when +all of the following hold: it renders only on stderr, it is gated on +`console.is_terminal` so piped/redirected output stays quiet, it is gated on +a minimum terminal width so narrow terminals fall back gracefully, and the +art is loaded from a shipped asset file rather than embedded as a multi-line +string literal. The test is "does this banner tell the user which tool +they're running?" — if yes, it is identity and allowed; if it is a fancy +border or a cute welcome message, it is decoration and still forbidden. +| `HEAVY` or `DOUBLE` border styles in normal UI | Visually aggressive; unnecessary | +| Random capitalization patterns across commands | Inconsistent; unprofessional | +| Different visual styles per command | Breaks cohesion across the tool | + +--- + +## Pre-Return Checklist + +Before returning any Rich CLI output code, silently verify: + +- [ ] Is there exactly one `Console` instance used throughout? +- [ ] Are all styles defined via a `Theme` and referenced by semantic name? +- [ ] Does every color on screen carry a specific meaning? +- [ ] Is every `Panel` justified — does the bounded structure add value? +- [ ] Are progress bars and status spinners wired to the shared console? +- [ ] Is presentation code separated from business logic? +- [ ] Is the output readable in a real terminal at 80-120 character width? +- [ ] Would a senior engineer consider this clean and pleasant to use daily? +- [ ] Are there zero em dashes in any prose, comments, or strings produced? +- [ ] Does the output scale as the CLI grows (no one-off style hacks)? + +If any answer is no, revise before returning. + +--- + +## Quick Decision Reference + +| Situation | Rich pattern to use | +| ----------------------------------- | ------------------------------------------------ | +| Single-line status message | `console.print("[info]...[/info]")` | +| Long-running task with known steps | `Progress` with `BarColumn` | +| Long-running task, unknown duration | `console.status()` with `dots` spinner | +| Fatal error with context | `Panel` with `border_style="red"` | +| Non-fatal warning | Inline `console.print("[warning]...[/warning]")` | +| Structured multi-column data | `Table` with semantic column styles | +| Section separator | `Rule` with a short title | +| Final summary (success/fail) | `Panel` with semantic border color | +| Parallel stats (e.g. dashboard) | `Columns` of small `Panel` items | +| Log output | `RichHandler` wired to shared console | +| Multiple phases of output | `Rule` between phases, consistent width | diff --git a/.claude/skills/task-executor/SKILL.md b/.claude/skills/task-executor/SKILL.md index 0d75452..8d3cb35 100644 --- a/.claude/skills/task-executor/SKILL.md +++ b/.claude/skills/task-executor/SKILL.md @@ -60,7 +60,7 @@ These rules come directly from the task list's Non-Negotiable Implementation Rul - Python 3.12 only. macOS and Linux only. - Core algorithm code must remain pure Python. No Rust, no C extensions, no external tokenizer libraries. -- `regex` is the only permitted runtime dependency beyond the standard library. +- `regex` and `rich` are the only permitted runtime dependencies beyond the standard library (`rich` is scoped to the CLI presentation layer and must never appear in the core algorithm import path). - No normalization, case folding, prefix-space insertion, or whitespace trimming anywhere in the pipeline. - `vocab_size` always refers to mergeable vocabulary size, excluding reserved special tokens. - The only reserved special token in v1 is the exact literal `<|endoftext|>`. diff --git a/CLAUDE.md b/CLAUDE.md index 665d340..fbbeb6c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -83,7 +83,7 @@ Do not skip any. Do not treat a subset as sufficient. - Internal modules use underscore prefix (`_trainer.py`, `_encoder.py`, `_persistence.py`, `_cli.py`). - The only public export is `Tokenizer`. Internal modules are not part of the public API. - `pyproject.toml` is the single source for package and tool configuration. -- Core library has one runtime dependency beyond stdlib: `regex`. No C extensions, no Rust, no external tokenizer libraries. +- The project has two runtime dependencies beyond stdlib: `regex` (pre-tokenizer) and `rich` (CLI presentation layer only, stderr-only). `rich` must never be imported from the core algorithm or public `Tokenizer` path. No C extensions, no Rust, no external tokenizer libraries. ## Testing @@ -112,6 +112,7 @@ Skills live in `.claude/skills/`. These are mandatory at their defined triggers | `bpe-algorithm` | Working on `_trainer.py`, `_encoder.py`, or related tests | BPE behavioral contract: pair counting, tie-breaking, merge application | | `artifact-schema` | Working on `_persistence.py` or `test_persistence.py` | JSON schema v1, atomic save, loader validation | | `cli-contract` | Working on `_cli.py` or `test_cli.py` | stdout/stderr discipline, exit codes, argparse patterns | +| `rich-cli` | Visual/UX work on `_cli.py`: Rich output, progress, panels, tracebacks | Rich-only presentation rules: themes, progress bars, panels, errors | | `pytest-conventions` | Writing or editing any file in `tests/` | Naming, parametrize, fixture patterns, import mode | | `task-executor` | Starting any task from the bpetite task list | Read task → confirm deps → implement → quality gate → verify acceptance criteria | | `commitall` | User runs `/commitall` or any intent to commit working changes | Audit working tree → draft Conventional Commits message → hand off → verify | diff --git a/docs/bpetite-prd-v2.md b/docs/bpetite-prd-v2.md index e4ce306..688f778 100644 --- a/docs/bpetite-prd-v2.md +++ b/docs/bpetite-prd-v2.md @@ -84,7 +84,7 @@ Most engineers use tokenizers as opaque dependencies and cannot reason about the ## Constraints - Core algorithm implementation must be pure Python; no Rust bindings, no C extensions, no external tokenizer libraries in the implementation path. -- The only runtime dependency beyond the standard library is `regex`. +- The only runtime dependencies beyond the standard library are `regex` (pre-tokenizer) and `rich` (CLI presentation layer only — must not appear in the core algorithm or `Tokenizer` import path). - `vocab_size` refers only to mergeable vocabulary size and excludes reserved special tokens. - The artifact format must be a single JSON file. - Supported platforms for v1 are macOS and Linux. Windows is not a supported execution target for the provided shell scripts. diff --git a/docs/bpetite-task-list.md b/docs/bpetite-task-list.md index 29e2055..f448a3d 100644 --- a/docs/bpetite-task-list.md +++ b/docs/bpetite-task-list.md @@ -16,7 +16,7 @@ If this document conflicts with the PRD, the PRD wins. - Python 3.12 is the only supported interpreter for v1. - macOS and Linux are the only supported execution targets for v1. - Core algorithm code must remain pure Python. -- `regex` is the only runtime dependency beyond the standard library. +- `regex` and `rich` are the only runtime dependencies beyond the standard library. `regex` powers the pre-tokenizer; `rich` powers the CLI presentation layer (stderr-only) and must never be imported from the core algorithm path. - Development dependencies must be declared as local development dependencies, not published extras. - No task may introduce normalization, case folding, prefix-space insertion, or whitespace trimming anywhere in the pipeline. - `vocab_size` always refers to mergeable vocabulary size and excludes reserved special tokens. diff --git a/pyproject.toml b/pyproject.toml index 97c366e..3af763b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -7,7 +7,7 @@ name = "bpetite" version = "0.1.0" description = "Deterministic byte-level BPE tokenizer." requires-python = ">=3.12" -dependencies = ["regex"] +dependencies = ["regex", "rich>=13.0"] [project.scripts] bpetite = "bpetite._cli:main" diff --git a/src/bpetite/_banner.txt b/src/bpetite/_banner.txt new file mode 100644 index 0000000..291007c --- /dev/null +++ b/src/bpetite/_banner.txt @@ -0,0 +1,13 @@ + _____ _____ +( ___ ) ( ___ ) + | |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | + | | ███████████ ███████████ ██████████ ███████████ █████ ███████████ ██████████ | | + | | ░░███░░░░░███░░███░░░░░███░░███░░░░░█░█░░░███░░░█░░███ ░█░░░███░░░█░░███░░░░░█ | | + | | ░███ ░███ ░███ ░███ ░███ █ ░ ░ ░███ ░ ░███ ░ ░███ ░ ░███ █ ░ | | + | | ░██████████ ░██████████ ░██████ ░███ ░███ ░███ ░██████ | | + | | ░███░░░░░███ ░███░░░░░░ ░███░░█ ░███ ░███ ░███ ░███░░█ | | + | | ░███ ░███ ░███ ░███ ░ █ ░███ ░███ ░███ ░███ ░ █ | | + | | ███████████ █████ ██████████ █████ █████ █████ ██████████ | | + | | ░░░░░░░░░░░ ░░░░░ ░░░░░░░░░░ ░░░░░ ░░░░░ ░░░░░ ░░░░░░░░░░ | | + |___|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|___| +(_____) (_____) diff --git a/src/bpetite/_cli.py b/src/bpetite/_cli.py index b1c74ba..4c2a4ce 100644 --- a/src/bpetite/_cli.py +++ b/src/bpetite/_cli.py @@ -1 +1,367 @@ -"""Command-line interface for the bpetite tokenizer.""" +"""Command-line interface for the bpetite tokenizer. + +Exposes three subcommands — ``train``, ``encode``, and ``decode`` — wired +into a single ``main`` entry point advertised as the ``bpetite`` console +script in ``pyproject.toml``. + +Channel discipline is the one rule that matters everywhere in this module: + +* Every machine-readable result (``train`` JSON summary, ``encode`` compact + JSON array, ``decode`` raw text) is written with ``sys.stdout.write`` so + no Rich markup or theming can bleed into the stdout contract fixed by + FR-33 and FR-34. +* Every human-readable element (banner, configuration panels, progress + bars, completion summaries, error messages) is routed through the + shared stderr :class:`~bpetite._ui.console`. + +The progress callback for training is threaded through the internal +:func:`bpetite._trainer.train_bpe` entry point rather than the public +``Tokenizer.train`` method: FR-30 pins the public method signature to +``(corpus, vocab_size)`` exactly, so the callback wiring lives in the CLI. +""" + +import argparse +import json +import sys +import time +from pathlib import Path +from typing import NoReturn + +from bpetite import Tokenizer +from bpetite._trainer import ProgressEvent, TrainerResult, train_bpe +from bpetite._ui import ( + console, + is_fully_interactive, + render_banner, + render_error, + render_kv_box, +) + +_TEXT_PREVIEW_LIMIT = 80 + + +def main() -> None: + """Parse arguments and dispatch to the selected subcommand.""" + parser = _build_parser() + args = parser.parse_args() + + if args.command == "train": + _cmd_train(args) + elif args.command == "encode": + _cmd_encode(args) + elif args.command == "decode": + _cmd_decode(args) + + +def _build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + prog="bpetite", + description="Deterministic byte-level BPE tokenizer.", + ) + sub = parser.add_subparsers(dest="command", required=True) + + p_train = sub.add_parser("train", help="Train a BPE tokenizer from a corpus.") + p_train.add_argument("--input", required=True, help="UTF-8 training corpus path.") + p_train.add_argument( + "--vocab-size", + type=int, + required=True, + help="Target mergeable vocabulary size (>= 256).", + ) + p_train.add_argument("--output", required=True, help="Artifact destination path.") + p_train.add_argument( + "--force", + action="store_true", + help="Overwrite the output artifact if it already exists.", + ) + + p_enc = sub.add_parser("encode", help="Encode text into a token id sequence.") + p_enc.add_argument("--model", required=True, help="Schema v1 artifact path.") + p_enc.add_argument("--text", required=True, help="UTF-8 text to encode.") + + p_dec = sub.add_parser("decode", help="Decode token ids into text.") + p_dec.add_argument("--model", required=True, help="Schema v1 artifact path.") + p_dec.add_argument( + "--ids", + nargs="+", + type=int, + required=True, + help="Space-separated token ids.", + ) + + return parser + + +def _cmd_train(args: argparse.Namespace) -> None: + render_banner() + render_kv_box( + rows=[ + ("Input", str(args.input)), + ("Vocab size", str(args.vocab_size)), + ("Output", str(args.output)), + ("Force overwrite", "yes" if args.force else "no"), + ], + title="Training", + ) + + _check_output_path(args.output, force=bool(args.force)) + + corpus = _read_corpus_or_exit(args.input) + corpus_bytes = len(corpus.encode("utf-8")) + + t0 = time.perf_counter() + result = _train_with_progress(corpus, args.vocab_size) + elapsed_ms = round((time.perf_counter() - t0) * 1000, 2) + + tokenizer = Tokenizer( + vocab=dict(result.vocab), + merges=list(result.merges), + special_tokens=dict(result.special_tokens), + ) + _save_or_exit(tokenizer, args.output, overwrite=bool(args.force)) + + render_kv_box( + rows=[ + ("Corpus bytes", f"{corpus_bytes:,}"), + ("Requested vocab size", str(args.vocab_size)), + ("Actual mergeable vocab size", str(result.mergeable_vocab_size)), + ("Special tokens", str(len(result.special_tokens))), + ("Elapsed", f"{elapsed_ms:.2f} ms"), + ("Saved to", str(args.output)), + ], + title="Training complete", + border_style="green", + ) + + summary = { + "corpus_bytes": corpus_bytes, + "requested_vocab_size": int(args.vocab_size), + "actual_mergeable_vocab_size": result.mergeable_vocab_size, + "special_token_count": len(result.special_tokens), + "elapsed_ms": elapsed_ms, + } + sys.stdout.write(json.dumps(summary) + "\n") + sys.stdout.flush() + + +def _cmd_encode(args: argparse.Namespace) -> None: + interactive = is_fully_interactive() + if interactive: + render_banner() + render_kv_box( + rows=[ + ("Model", str(args.model)), + ("Text", _truncate(str(args.text), _TEXT_PREVIEW_LIMIT)), + ], + title="Encoding", + ) + + tokenizer = _load_model_or_exit(args.model) + + t0 = time.perf_counter() + ids = tokenizer.encode(args.text) + elapsed_ms = round((time.perf_counter() - t0) * 1000, 2) + + if interactive: + render_kv_box( + rows=[ + ("Tokens", str(len(ids))), + ("Elapsed", f"{elapsed_ms:.2f} ms"), + ], + title="Encoded", + border_style="green", + ) + + sys.stdout.write(json.dumps(ids, separators=(",", ":")) + "\n") + sys.stdout.flush() + + +def _cmd_decode(args: argparse.Namespace) -> None: + interactive = is_fully_interactive() + if interactive: + render_banner() + render_kv_box( + rows=[ + ("Model", str(args.model)), + ("Token count", str(len(args.ids))), + ], + title="Decoding", + ) + + tokenizer = _load_model_or_exit(args.model) + + t0 = time.perf_counter() + try: + text = tokenizer.decode(args.ids) + except KeyError as exc: + _fail( + title="Unknown token id", + message=f"Token id {exc.args[0]} is not in the model's vocabulary.", + hint="Every id passed to --ids must exist in the loaded model.", + ) + except UnicodeDecodeError as exc: + _fail( + title="Invalid UTF-8 in decode", + message=f"Decoded bytes are not valid UTF-8: {exc}", + hint="This token sequence is incomplete or does not form valid text.", + ) + elapsed_ms = round((time.perf_counter() - t0) * 1000, 2) + + if interactive: + render_kv_box( + rows=[ + ("Characters", str(len(text))), + ("Elapsed", f"{elapsed_ms:.2f} ms"), + ], + title="Decoded", + border_style="green", + ) + + sys.stdout.write(text) + sys.stdout.flush() + + +def _check_output_path(path: str, *, force: bool) -> None: + """Fail fast if the destination is already blocking the save. + + Both checks are cheaply verifiable before training starts. Running them + up-front prevents spending minutes on a training pass only to discover + that the ``--output`` path was already taken or its parent directory + was missing. The save call at the end of ``train`` still catches the + same conditions as a safety net for race conditions. + """ + output_path = Path(path) + if output_path.exists() and not force: + _fail( + title="Save blocked", + message=f"{path} already exists.", + hint="Re-run with --force to overwrite, or choose a different --output.", + ) + if not output_path.parent.exists(): + _fail( + title="Save failed", + message=f"Parent directory of {path} does not exist.", + hint="Create the parent directory before running train.", + ) + + +def _read_corpus_or_exit(path: str) -> str: + try: + return Path(path).read_bytes().decode("utf-8") + except FileNotFoundError: + _fail( + title="Input not found", + message=f"No such file: {path}", + hint="Check the --input path and try again.", + ) + except OSError as exc: + _fail( + title="Input unreadable", + message=f"Cannot read {path}: {exc}", + hint="Check that --input is a regular file and you have read permission.", + ) + except UnicodeDecodeError as exc: + _fail( + title="Invalid UTF-8 corpus", + message=f"{path} is not valid UTF-8: {exc}", + hint="bpetite reads training corpora with strict UTF-8 decoding.", + ) + + +def _train_with_progress(corpus: str, vocab_size: int) -> TrainerResult: + """Run ``train_bpe`` with plain styled lifecycle lines on stderr. + + The callback emits three kinds of lines on the shared stderr + ``Console``, matching the task-list requirement that ``train`` write + progress updates at start, every 100 completed merges, and + completion. Rich ``Progress`` is intentionally not used: its live + display produced subtle rendering errors on zero-merge runs, + early-stop runs, and invalid-vocab runs. Plain ``console.print`` + lines sidestep every edge case while still rendering beautifully + through the themed console. + """ + + def _on_event(event: ProgressEvent) -> None: + if event.kind == "start": + console.print( + f"[info]Training started: planned={event.merges_planned}[/info]" + ) + elif event.kind == "merge": + console.print( + f"[info]Training merges: {event.merges_completed}" + f" / {event.merges_planned}[/info]" + ) + else: # complete + console.print( + f"[success]Training complete: merges={event.merges_completed}[/success]" + ) + + try: + return train_bpe(corpus, vocab_size, progress=_on_event) + except ValueError as exc: + _fail( + title="Invalid vocab size", + message=str(exc), + hint="--vocab-size must be at least 256.", + ) + + +def _save_or_exit(tokenizer: Tokenizer, path: str, *, overwrite: bool) -> None: + try: + tokenizer.save(path, overwrite=overwrite) + except FileExistsError: + _fail( + title="Save blocked", + message=f"{path} already exists.", + hint="Re-run with --force to overwrite.", + ) + except FileNotFoundError: + _fail( + title="Save failed", + message=f"Parent directory of {path} does not exist.", + hint="Create the parent directory before running train.", + ) + except OSError as exc: + _fail( + title="Save failed", + message=f"Cannot write {path}: {exc}", + hint="Check filesystem permissions and that --output is not a directory.", + ) + + +def _load_model_or_exit(path: str) -> Tokenizer: + try: + return Tokenizer.load(path) + except FileNotFoundError: + _fail( + title="Model not found", + message=f"No such file: {path}", + hint="Pass --model pointing at a Schema v1 artifact.", + ) + except OSError as exc: + _fail( + title="Model unreadable", + message=f"Cannot read {path}: {exc}", + hint="Check that --model is a regular file and you have read permission.", + ) + except (KeyError, ValueError) as exc: + _fail( + title="Model load failed", + message=str(exc), + hint=f"Verify {path} is a valid Schema v1 bpetite artifact.", + ) + + +def _fail(*, title: str, message: str, hint: str | None = None) -> NoReturn: + render_error(title=title, message=message, hint=hint) + sys.exit(1) + + +def _truncate(value: str, limit: int) -> str: + if len(value) <= limit: + return value + return value[: limit - 3] + "..." + + +if __name__ == "__main__": + main() diff --git a/src/bpetite/_ui.py b/src/bpetite/_ui.py new file mode 100644 index 0000000..ba43f27 --- /dev/null +++ b/src/bpetite/_ui.py @@ -0,0 +1,159 @@ +"""Shared Rich presentation layer for the bpetite command-line interface. + +All human-readable output across ``train``, ``encode``, and ``decode`` is +routed through the single :data:`console` instance defined in this module. +The console is constructed with ``stderr=True`` so every Rich render — the +banner, configuration panels, progress bars, and error panels — lands on +stderr without polluting the machine-readable contract on stdout. + +Machine-readable results (``train`` JSON summary, ``encode`` compact JSON +array, ``decode`` raw text) are **not** rendered through this module. They +are written via ``sys.stdout.write`` directly in ``_cli.py`` so no Rich +markup, theme, or styling can bleed into the stdout contract enforced by +FR-33 and FR-34. + +The banner is a deliberate brand identity element. It renders only when +stderr is an interactive terminal and the terminal is wide enough to hold +the art cleanly; in any non-interactive context (pipes, redirects, CI +capture, subprocess tests) the banner is suppressed so the stderr stream +stays quiet and test assertions remain stable. +""" + +import sys +from pathlib import Path +from typing import Final + +from rich.box import ROUNDED +from rich.console import Console, RenderableType +from rich.markup import escape as markup_escape +from rich.panel import Panel +from rich.table import Table +from rich.text import Text +from rich.theme import Theme + +_BANNER_PATH: Final[Path] = Path(__file__).parent / "_banner.txt" +_BANNER_MIN_COLUMNS: Final[int] = 95 + +_THEME: Final[Theme] = Theme( + { + "info": "cyan", + "success": "bold green", + "warning": "bold yellow", + "error": "bold red", + "muted": "dim white", + "heading": "bold white", + "banner": "bold magenta", + "accent": "bold cyan", + "label": "bold cyan", + "value": "white", + } +) + +console: Final[Console] = Console( + stderr=True, + theme=_THEME, + soft_wrap=False, + highlight=False, +) + + +def _load_banner() -> str: + return _BANNER_PATH.read_text(encoding="utf-8").rstrip("\n") + + +def is_fully_interactive() -> bool: + """Return ``True`` only when both stderr and stdout are TTYs. + + The CLI uses this gate to decide whether decorative output is + appropriate for the current run. When stdout is captured for any + reason — shell command substitution, ``subprocess.run(stdout=PIPE)``, + file redirection — the caller is in machine-consumption mode and the + decorative stderr surface should stay quiet so wrappers that treat any + stderr bytes as a warning signal do not regress. Both streams must be + TTYs to consider the run fully interactive. + """ + return console.is_terminal and sys.stdout.isatty() + + +def banner_enabled() -> bool: + """Return ``True`` when the banner can render cleanly on this stream. + + The check combines a full-interactivity probe with a minimum-width + floor: a non-interactive run (CI, pipes, subprocess capture, captured + stdout) yields ``False`` so redirected output stays free of ornament, + and a terminal narrower than the art yields ``False`` so the banner + never wraps. + """ + return is_fully_interactive() and console.size.width >= _BANNER_MIN_COLUMNS + + +def render_banner() -> None: + """Print the centered ASCII banner to stderr if the terminal allows it.""" + if not banner_enabled(): + return + console.print(_load_banner(), style="banner", justify="center") + console.print() + + +def _kv_table(rows: list[tuple[str, str]]) -> Table: + table = Table.grid(padding=(0, 2)) + table.add_column(style="label", justify="left", no_wrap=True) + table.add_column(style="value", justify="left", overflow="fold") + for label, value in rows: + # Wrap value in Text so Rich does not parse `[...]` as markup. Values + # frequently carry user-supplied paths and free-form text that may + # contain literal bracket characters; markup parsing would either + # silently rewrite the content or raise MarkupError mid-render. + table.add_row(label, Text(value)) + return table + + +def render_box( + body: RenderableType, + title: str, + border_style: str = "cyan", +) -> None: + """Render ``body`` inside a full-width rounded Panel on stderr.""" + console.print( + Panel( + body, + title=f"[heading]{title}[/heading]", + border_style=border_style, + box=ROUNDED, + padding=(1, 2), + expand=True, + ) + ) + + +def render_kv_box( + rows: list[tuple[str, str]], + title: str, + border_style: str = "cyan", +) -> None: + """Render a label/value table inside a full-width Panel on stderr.""" + render_box(_kv_table(rows), title=title, border_style=border_style) + + +def render_error(title: str, message: str, hint: str | None = None) -> None: + """Render a fatal-error Panel on stderr with optional recovery hint. + + ``message`` and ``hint`` are escaped through ``rich.markup.escape`` before + being interpolated into the markup-bearing body so that user-supplied + content (paths, exception strings, model artifact text) cannot inject or + break Rich markup. ``title`` is left as-is because every call site passes + a hardcoded literal. + """ + body = f"[error]{markup_escape(message)}[/error]" + if hint: + body += f"\n\n[info]{markup_escape(hint)}[/info]" + console.print( + Panel( + body, + title=f"[error]{title}[/error]", + border_style="red", + box=ROUNDED, + padding=(1, 2), + expand=True, + ) + ) diff --git a/uv.lock b/uv.lock index 96283a9..cfea5c8 100644 --- a/uv.lock +++ b/uv.lock @@ -8,6 +8,7 @@ version = "0.1.0" source = { editable = "." } dependencies = [ { name = "regex" }, + { name = "rich" }, ] [package.dev-dependencies] @@ -28,7 +29,10 @@ docs = [ ] [package.metadata] -requires-dist = [{ name = "regex" }] +requires-dist = [ + { name = "regex" }, + { name = "rich", specifier = ">=13.0" }, +] [package.metadata.requires-dev] dev = [ @@ -589,6 +593,19 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/d7/8e/7540e8a2036f79a125c1d2ebadf69ed7901608859186c856fa0388ef4197/requests-2.33.1-py3-none-any.whl", hash = "sha256:4e6d1ef462f3626a1f0a0a9c42dd93c63bad33f9f1c1937509b8c5c8718ab56a", size = 64947, upload-time = "2026-03-30T16:09:13.83Z" }, ] +[[package]] +name = "rich" +version = "15.0.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "markdown-it-py" }, + { name = "pygments" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/c0/8f/0722ca900cc807c13a6a0c696dacf35430f72e0ec571c4275d2371fca3e9/rich-15.0.0.tar.gz", hash = "sha256:edd07a4824c6b40189fb7ac9bc4c52536e9780fbbfbddf6f1e2502c31b068c36", size = 230680, upload-time = "2026-04-12T08:24:00.75Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/82/3b/64d4899d73f91ba49a8c18a8ff3f0ea8f1c1d75481760df8c68ef5235bf5/rich-15.0.0-py3-none-any.whl", hash = "sha256:33bd4ef74232fb73fe9279a257718407f169c09b78a87ad3d296f548e27de0bb", size = 310654, upload-time = "2026-04-12T08:24:02.83Z" }, +] + [[package]] name = "ruff" version = "0.15.10"