feat(scripts): replace download_data.sh with download_corpus.py#34
Merged
Conversation
bpetite workflows
PR #34: all tracked workflows are green. Merge readiness is still subject to branch protection rules. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes Task 4-3 by landing the pure-Python TinyShakespeare downloader the PRD specifies, retiring the
scripts/download_data.shbash stopgap that was explicitly scaffolded as a placeholder until this task shipped, and updating theforbid-core-networkingpre-commit hook docstring to match.Why
The PRD pins
scripts/download_corpus.pyas the one authoritative path for fetching the TinyShakespeare demo corpus, andscripts/download_data.shwas a temporary bash shim whose own header read "PRD task 4-3 specifies this as scripts/download_corpus.py... Until that task lands, this bash shim fetches the same corpus." With Task 4-3 landing, the shim's exit condition is met — keeping both helpers around would leave two downloaders for the same destination and a stale hook docstring referencing a file that no longer exists.Moving the fetch to
urllib.requestalso kills thecurlruntime dependency for a fresh reviewer clone. Downstream Phase 4 tasks (benchmark harness, README walkthrough) can now assume a stdlib-only corpus-bootstrap path on both supported OS targets without installing extra tooling.Changes
scripts/download_corpus.py— new 80-line stdlib-only downloader:https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txtviaurllib.request.urlopenwith a 30-second timeoutdata/tinyshakespeare.txtat the repo root to preserve the corpus byte-for-bytedata/viamkdir(parents=True, exist_ok=True)so a fresh clone works without pre-setupurllib.error.URLError,TimeoutError, andOSErrorand translates each into an exit-1 with a specific stderr messagestderr;stdoutis intentionally left empty so the script matches the rest of thebpetitestdout/stderr convention# noqa: S310on theurlopencall — the URL is a hardcoded constant, not user inputdownload_data.sh, and the network-boundary scoping so a future reader understands why a network-bound helper lives inscripts/scripts/download_data.sh— deleted. Its own header comment declared it a stopgap that would be removed when Task 4-3 landed.scripts/hooks/forbid_core_networking.py— hook docstring updated from "The only networked helper isscripts/download_data.sh(to be ported toscripts/download_corpus.pyin PRD task 4-3)" to "The only networked helper isscripts/download_corpus.py(TinyShakespeare demo corpus fetcher, landed in PRD task 4-3)". The hook's runtime behavior (AST scan for networking imports insidesrc/bpetite/) is unchanged.Validation
uv run pytest— 192 passeduv run ruff check .— cleanuv run ruff format --check .— 30 files already formatteduv run mypy --strict— no issues found in 29 source filesManual checks:
python scripts/download_corpus.py→ exit 0,data/tinyshakespeare.txtis 1,115,394 bytes (canonical size), opens with"First Citizen:", stderr carries two status lines, stdout empty.git check-ignore -v data/tinyshakespeare.txtresolves to.gitignore:49:data/tinyshakespeare.txt— the corpus is gitignored and will never be committed.rg 'download_corpus|download_data' src/returns no matches — neither the core library nor the CLI imports the script.rg 'scripts\.|from scripts|import scripts' src/returns no matches —scripts/is not on thebpetiteimport path.Risks / Follow-ups
forbid-core-networkingpre-commit hook independently guardssrc/bpetite/at commit time, so no regression can leak a network import into the core library without surfacing as a hook failure.data/tinyshakespeare.txtproduced by this downloader and record actual timings on Dinesh's machine. No code change in this PR blocks or advances that task, but it is the natural next step in Phase 4.