Skip to content

feat(scripts): replace download_data.sh with download_corpus.py#34

Merged
dinesh-git17 merged 1 commit into
mainfrom
feat/download-corpus-py
Apr 15, 2026
Merged

feat(scripts): replace download_data.sh with download_corpus.py#34
dinesh-git17 merged 1 commit into
mainfrom
feat/download-corpus-py

Conversation

@dinesh-git17

Copy link
Copy Markdown
Owner

Summary

Closes Task 4-3 by landing the pure-Python TinyShakespeare downloader the PRD specifies, retiring the scripts/download_data.sh bash stopgap that was explicitly scaffolded as a placeholder until this task shipped, and updating the forbid-core-networking pre-commit hook docstring to match.

Why

The PRD pins scripts/download_corpus.py as the one authoritative path for fetching the TinyShakespeare demo corpus, and scripts/download_data.sh was a temporary bash shim whose own header read "PRD task 4-3 specifies this as scripts/download_corpus.py... Until that task lands, this bash shim fetches the same corpus." With Task 4-3 landing, the shim's exit condition is met — keeping both helpers around would leave two downloaders for the same destination and a stale hook docstring referencing a file that no longer exists.

Moving the fetch to urllib.request also kills the curl runtime dependency for a fresh reviewer clone. Downstream Phase 4 tasks (benchmark harness, README walkthrough) can now assume a stdlib-only corpus-bootstrap path on both supported OS targets without installing extra tooling.

Changes

  • scripts/download_corpus.py — new 80-line stdlib-only downloader:
    • Fetches https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt via urllib.request.urlopen with a 30-second timeout
    • Writes bytes (not text) to data/tinyshakespeare.txt at the repo root to preserve the corpus byte-for-byte
    • Creates data/ via mkdir(parents=True, exist_ok=True) so a fresh clone works without pre-setup
    • Catches urllib.error.URLError, TimeoutError, and OSError and translates each into an exit-1 with a specific stderr message
    • All human-readable output goes to stderr; stdout is intentionally left empty so the script matches the rest of the bpetite stdout/stderr convention
    • # noqa: S310 on the urlopen call — the URL is a hardcoded constant, not user input
    • Docstring documents the PRD-authoritative path, the retirement of download_data.sh, and the network-boundary scoping so a future reader understands why a network-bound helper lives in scripts/
  • scripts/download_data.sh — deleted. Its own header comment declared it a stopgap that would be removed when Task 4-3 landed.
  • scripts/hooks/forbid_core_networking.py — hook docstring updated from "The only networked helper is scripts/download_data.sh (to be ported to scripts/download_corpus.py in PRD task 4-3)" to "The only networked helper is scripts/download_corpus.py (TinyShakespeare demo corpus fetcher, landed in PRD task 4-3)". The hook's runtime behavior (AST scan for networking imports inside src/bpetite/) is unchanged.

Validation

  • uv run pytest — 192 passed
  • uv run ruff check . — clean
  • uv run ruff format --check . — 30 files already formatted
  • uv run mypy --strict — no issues found in 29 source files

Manual checks:

  • End-to-end download verified: python scripts/download_corpus.py → exit 0, data/tinyshakespeare.txt is 1,115,394 bytes (canonical size), opens with "First Citizen:", stderr carries two status lines, stdout empty.
  • git check-ignore -v data/tinyshakespeare.txt resolves to .gitignore:49:data/tinyshakespeare.txt — the corpus is gitignored and will never be committed.
  • rg 'download_corpus|download_data' src/ returns no matches — neither the core library nor the CLI imports the script.
  • rg 'scripts\.|from scripts|import scripts' src/ returns no matches — scripts/ is not on the bpetite import path.

Risks / Follow-ups

  • The downloader reaches the public internet on every run; it is intended for local demo/benchmark setup and is never called from the library or CLI runtime path. The forbid-core-networking pre-commit hook independently guards src/bpetite/ at commit time, so no regression can leak a network import into the core library without surfacing as a hook failure.
  • Task 4-4 (benchmark harness) is human-owned — it will consume data/tinyshakespeare.txt produced by this downloader and record actual timings on Dinesh's machine. No code change in this PR blocks or advances that task, but it is the natural next step in Phase 4.

@github-actions

github-actions Bot commented Apr 15, 2026

Copy link
Copy Markdown

bpetite workflows

Workflow Status Comment if failure and where
tests success ok
lint success ok
syntax success ok
format success ok
types success ok
build success ok
cli-smoke success ok
determinism success ok
policy-guard success ok
ci-meta success ok

PR #34: all tracked workflows are green. Merge readiness is still subject to branch protection rules.

@dinesh-git17 dinesh-git17 merged commit ce2dea1 into main Apr 15, 2026
15 checks passed
@dinesh-git17 dinesh-git17 deleted the feat/download-corpus-py branch April 15, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant