From 15fbe1d0b14229604e8aab2744ea968ce29e41fb Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" Date: Fri, 5 Jun 2026 12:45:39 +0530 Subject: [PATCH 01/10] Add review-curator-pr Cursor skill Add a self-contained Cursor Agent Skill under .cursor/skills/review-curator-pr that helps contributors and maintainers review NVIDIA-NeMo/Curator pull requests consistently. The skill: - pulls fresh PR data via the GitHub CLI (REST endpoints + GraphQL review threads, which carry isResolved/isOutdated); - generates a structured review digest and a paste-ready open-comment queue; - applies review lenses built on the repo's own .cursor/rules contracts and CONTRIBUTING.md standards (stage contracts, setup lifecycle, dependency hygiene, secret-safe logging, tests/coverage, PR reviewability). Includes two helper scripts (scripts/pr_review_pull.sh, scripts/build_digest.py) that write to a scratch .curator-pr-review/ directory, which is gitignored. Signed-off-by: aaftaabv@gmail.com --- .cursor/skills/review-curator-pr/SKILL.md | 140 +++++++++ .../review-curator-pr/recurring-themes.md | 127 ++++++++ .cursor/skills/review-curator-pr/reference.md | 86 ++++++ .../review-curator-pr/scripts/README.md | 32 ++ .../review-curator-pr/scripts/build_digest.py | 279 ++++++++++++++++++ .../scripts/pr_review_pull.sh | 92 ++++++ .cursor/skills/review-curator-pr/templates.md | 134 +++++++++ .gitignore | 3 + 8 files changed, 893 insertions(+) create mode 100644 .cursor/skills/review-curator-pr/SKILL.md create mode 100644 .cursor/skills/review-curator-pr/recurring-themes.md create mode 100644 .cursor/skills/review-curator-pr/reference.md create mode 100644 .cursor/skills/review-curator-pr/scripts/README.md create mode 100755 .cursor/skills/review-curator-pr/scripts/build_digest.py create mode 100755 .cursor/skills/review-curator-pr/scripts/pr_review_pull.sh create mode 100644 .cursor/skills/review-curator-pr/templates.md diff --git a/.cursor/skills/review-curator-pr/SKILL.md b/.cursor/skills/review-curator-pr/SKILL.md new file mode 100644 index 0000000000..cf6a7eb657 --- /dev/null +++ b/.cursor/skills/review-curator-pr/SKILL.md @@ -0,0 +1,140 @@ +--- +name: review-curator-pr +description: >- + Review NVIDIA-NeMo/Curator pull requests against the project's stage + contracts and contribution standards. Pulls fresh PR data with the GitHub + CLI, diffs the changed code, applies Curator's documented review lenses, and + emits a structured review digest plus a paste-ready comment queue. Use when a + contributor or maintainer asks to review, re-review, or triage a Curator PR + (e.g. "review PR 1967", "what's still open on PR 1898", "triage this PR"), or + to draft replies to reviewer comments. +--- + +# Review NeMo Curator PRs + +Produce a defensible, reproducible review of an `NVIDIA-NeMo/Curator` PR from a +fresh GitHub pull plus the code diff. The audience is the PR author (who must +respond to reviewers and land the change) and maintainers triaging the queue. + +Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a +local checkout of this repository (the one containing this skill). + +## Two deliverables (always) + +Each review run writes two Markdown files into a scratch directory (default +`.curator-pr-review/`, override with `--outdir`): + +1. `curator_pr_fresh_review_.md` - full digest: PR state, + commits, changed-file table, reviews, every inline comment grouped by file + with OPEN / OUTDATED / RESOLVED status, and issue comments. +2. `curator_pr_github_comment_queue_.md` - only the OPEN root + threads that need an author reply or code change, paste-ready, with stale + threads listed at the bottom for traceability. + +See [templates.md](templates.md) for the exact section layout. Keeping the +structure stable makes diffs across review passes mechanical, so a re-review +only surfaces what changed. + +## Workflow + +``` +- [ ] 1. Identify the PR number +- [ ] 2. Pull fresh GitHub data (gh) +- [ ] 3. Check out the PR head and diff against the base +- [ ] 4. Read the changed code through the review lenses +- [ ] 5. Generate the digest + comment queue +- [ ] 6. Classify findings P0-P3 and (optionally) draft replies +``` + +### Step 1 - Identify the PR + +Confirm the target: `gh pr view --repo NVIDIA-NeMo/Curator`. If a prior +review digest for this PR exists, read it first and record its head SHA as the +baseline so commits and comments can be marked NEW vs already-seen. + +### Step 2 - Pull fresh GitHub data + +Run the pull script (writes `pr_*_latest.json` plus timestamped snapshots so +prior pulls are preserved for delta analysis): + +```bash +.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh +``` + +It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue +`comments`, `files`, `commits`) and the GraphQL review threads, which carry the +`isResolved` / `isOutdated` flags the REST inline endpoint omits. + +### Step 3 - Diff the code + +```bash +gh pr checkout # or: git fetch origin pull//head:pr- +git diff origin/main...HEAD # changed files on this PR +``` + +Always review the actual diff, never the comment text alone. Use `main` as the +baseline for "how the project already does this". + +### Step 4 - Read through the review lenses + +Apply the lenses in [recurring-themes.md](recurring-themes.md): stage +contracts, setup/teardown lifecycle, dependency and optional-extra hygiene, +secret-safe logging, memory/serialization of large payloads, streaming and +performance, tutorials/docs, tests/coverage, and PR size/reviewability. + +These build on the project's own always-on rules in `.cursor/rules/` +(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`, +`executors`, `resources-configuration`, `pipeline-structure`, +`modality-structure`, `coding-standards`) - treat those as the canonical +contract and cite them when a change deviates. + +### Step 5 - Generate the two files + +```bash +.cursor/skills/review-curator-pr/scripts/build_digest.py --today +``` + +The builder joins the pulled JSON and classifies each comment by thread status: +**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**. +The comment queue lists only OPEN root threads. + +### Step 6 - Findings and replies + +In the digest's findings section, classify each issue by severity: + +- **P0** - merge blocker: data loss / crash on a valid config, a secret leaked + into logs, or an import that breaks a supported install. +- **P1** - fix before merge: undeclared runtime dependencies, missing required + tests/coverage, environment-specific code in a reusable path, a PR too large + to review safely. +- **P2** - should fix: duplicated logic, inefficient batch handling, incomplete + metrics/observability, documented-but-unwired config. +- **P3** - nice to have: temporary workarounds that need a test + comment, + debug-only knobs to document. + +When drafting replies, cite the exact `path:line-range` on the current head, +state whether the point is already addressed (with the commit SHA) or will be, +and keep each reply self-contained. See [templates.md](templates.md) section C. + +## Conventions to enforce (from CONTRIBUTING.md and rules) + +- **DCO sign-off** is required on every commit (`git commit -s`). +- **Tests** must accompany source changes; GPU-only tests use + `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`, + and `tests/` mirrors the `nemo_curator/` layout. +- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file. +- **Ruff** lint/format, 119-char lines; **loguru** for logging. +- Supported Python: 3.10-3.12. + +## Knowledge sources + +Full index in [reference.md](reference.md). Quick map: + +| Need | Source | +|------|--------| +| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` | +| Modality guides | `nemo_curator/stages//README.md`, `tutorials//README.md` | +| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` | +| Perf expectations | `benchmarking/` | +| Contribution rules | `CONTRIBUTING.md` | +| Live PR data | `gh` (see scripts) | diff --git a/.cursor/skills/review-curator-pr/recurring-themes.md b/.cursor/skills/review-curator-pr/recurring-themes.md new file mode 100644 index 0000000000..d0599d2276 --- /dev/null +++ b/.cursor/skills/review-curator-pr/recurring-themes.md @@ -0,0 +1,127 @@ +# Review lenses - recurring themes in Curator PRs + +Apply each lens to the diff and cite exact `path:line` evidence. These extend +the always-on rules in `.cursor/rules/`; when a change violates a rule, link the +rule by name. Most points are modality-agnostic; audio examples are +illustrative. + +## Stage contracts and reuse + +- A stage subclasses `ProcessingStage[InputTask, OutputTask]` with explicit + `inputs()` / `outputs()`, a meaningful `name`, and appropriate `resources` / + `batch_size` (set as class attributes / dataclass fields, never as the + read-only `_name` / `_resources` / `_batch_size` properties). +- GPU or vectorizable work belongs in `process_batch()` with a real + `batch_size`; `__init__` stays lightweight (it runs on the driver). +- Reuse existing readers, writers, and shared model abstractions instead of + bespoke orchestration. Use `CompositeStage` to present several internal stages + as one user-facing stage. +- If `inputs()` does not declare a required key, runtime validation will not + check it, and a direct `task.data[key]` access then raises an opaque + `KeyError` instead of a clear validation error. Declare required keys. +- Fan-out stages (one task -> many) need deterministic child IDs and propagated + `_metadata` / `_stage_perf`. First-stage discovery from an empty input should + constrain placement so work is not duplicated across every worker. + +## Setup / teardown lifecycle and dependencies + +- `setup_on_node()` = once-per-node prep (e.g. model downloads / shared cache). + `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees + resources. Don't conflate them. +- Import heavy or optional dependencies lazily inside `setup()` (or behind a + guarded import), never unconditionally at module top level, when the dependency + ships only with an optional extra. A top-level `import ` breaks + `import nemo_curator...` on any install without that extra. +- Declare the full runtime dependency set in the appropriate `pyproject.toml` + optional extra; add an import smoke test under that extra. Don't rely on an + external Docker image to supply dependencies a reusable stage needs. +- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries + shared across modalities (e.g. `transformers`, `huggingface-hub`, + `accelerate`); prefer compatible ranges so other extras don't break. + +## Secret-safe logging + +- Never log a full resolved config object; it can contain tokens / credentials + (e.g. an HF token, access keys). Redact secret-like keys before logging, or + log only a non-secret subset, and add a test asserting the secret value never + appears in logged output. + +## Memory and serialization of large payloads + +- Drop large in-memory payloads (waveform arrays, image/video tensors) from + `task.data` once a stage no longer needs them. Leaked arrays reach downstream + JSON writers and raise `TypeError: Object of type ndarray is not JSON + serializable` (which can fail an entire batch) or cause memory blowups. Strip + by key plus a duck-typed guard, and wrap `json.dumps` so a stray + non-serializable value reports the offending key instead of crashing. +- Guard external I/O edge cases (e.g. `tarfile.extractfile()` returns `None` for + non-regular members even after `isfile()`; check before `.read()`). + +## Streaming, performance, benchmarking + +- Streaming execution with per-stage resource configuration is the default model + for scalable pipelines (see `executors.mdc`). +- Emit stage metrics through the supported metrics path; keep records + JSON-shaped. For multi-node analysis, include worker/node/GPU identity and + latency distributions (p50/p95/p99), not just totals/averages. +- Writers that loop per task inside `process_batch()` reopen output files + repeatedly; group by output target and open once per batch. +- Standalone benchmark scripts belong in the `benchmarking/` flow with config + entries and comparable parameters - not freestanding. + +## Environment-specific code in reusable paths + +- Keep infrastructure assumptions (specific object-store URLs, auth env vars, + cluster-specific paths) out of generic reader/writer code. Make remote access + pluggable: accept local paths, ordinary cloud URIs, or a user-provided + opener/resolver, and isolate any environment-specific support behind a small, + documented, optional component. + +## Tutorials and docs + +- Tutorials must be runnable, minimal, and use public APIs only - no private + APIs, no machine-local absolute paths, no committed noisy notebook output, no + executor mismatch between a notebook and its CLI. Prefer `uv` for environment + setup if the tutorial standard does. +- Don't duplicate pipeline-construction logic between a tutorial entrypoint and + an `examples/` script; extract one shared builder both call. +- Every config key documented for users must actually reach a stage; validate + config-to-stage mapping and fail (or warn) on unused keys. +- Don't commit generated analysis/scratch docs into the source tree. + +## Tests and standards (CONTRIBUTING + coding-standards) + +- Source changes need accompanying tests; `tests/` mirrors `nemo_curator/`. The + project enforces 80% coverage under `nemo_curator/`. GPU-only tests use + `@pytest.mark.gpu`. +- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header. +- Ruff clean at 119-char lines; type annotations on functions; loguru for logs. +- Every commit is DCO signed-off. + +## PR size and reviewability + +- Large multi-purpose PRs are hard to review and to defend. Prefer splitting by + stage / module / tutorial / benchmark. If a single PR must remain, organize it + as if sliced: no duplicated builder logic, clear module ownership, and a + reviewable commit sequence. + +## Smaller recurring nits + +- Missing copyright headers on new files. +- Top-level imports that should be lazy/guarded. +- Long inline strings (prompts, configs) that should be module constants. +- Dataclass fields validated in `__post_init__` that could just be typed. +- Dead conditionals / redundant truthiness checks. +- Tests added for tutorials that don't warrant them. +- Debug-only knobs presented as features - document them as debug throttles. +- `trust_remote_code=True` (or similar) hardcoded - expose as a parameter so + security-conscious users can opt out. + +## Severity mapping + +| Severity | Meaning | +|----------|---------| +| P0 | Merge blocker: crash/data-loss on a valid config, secret leaked to logs, import breaks a supported install | +| P1 | Fix before merge: undeclared deps, missing required tests/coverage, env-specific code in reusable path, PR too large | +| P2 | Should fix: duplicated logic, inefficient batching, incomplete metrics, documented-but-unwired config | +| P3 | Nice to have: temporary workaround needing test+comment, debug knob docs | diff --git a/.cursor/skills/review-curator-pr/reference.md b/.cursor/skills/review-curator-pr/reference.md new file mode 100644 index 0000000000..40bdd0c0cc --- /dev/null +++ b/.cursor/skills/review-curator-pr/reference.md @@ -0,0 +1,86 @@ +# Reference - NeMo Curator PR review + +All paths are relative to the repository root. The skill is self-contained: it +relies only on this repository plus the GitHub CLI (`gh`). + +## Canonical contracts (`.cursor/rules/`) + +These always-on rules are the authoritative description of how stages and +pipelines must be written. Cite them when a PR deviates. + +| Rule | Covers | +|------|--------| +| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks | +| `composite-stage-patterns.mdc` | `CompositeStage` decomposition | +| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow | +| `executors.mdc` | Xenna / Ray Data backends and when to use each | +| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation | +| `pipeline-structure.mdc` | Pipeline assembly | +| `modality-structure.mdc` | Where modality code/tests/tutorials live | +| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | + +## Framework code (diff ground truth) + +- `nemo_curator/stages/base.py` - `ProcessingStage` base class. +- `nemo_curator/backends/` - `base.py`, `xenna/`, `ray_data/`, + `ray_actor_pool/`; adapters define how stages run on each backend. +- `nemo_curator/pipeline/pipeline.py` - pipeline construction and guards. +- `nemo_curator/tasks/` - task types. + +## Modality guides + +| Modality | Stage README | Tutorials | +|----------|--------------|-----------| +| audio | `nemo_curator/stages/audio/README.md` | `tutorials/audio/` | +| interleaved | `nemo_curator/stages/interleaved/README.md` | `tutorials/interleaved/` | +| text / image / video / math / synthetic | (stage code under `nemo_curator/stages/`) | `tutorials/{text,image,video,math,synthetic}/` | + +## Performance / benchmarking + +- `benchmarking/README.md` - how benchmarks are wired and run. +- `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` - modality + perf expectations. Standalone benchmark scripts should be integrated here, not + left freestanding. + +## Contribution rules + +- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test + requirements, coverage threshold, copyright header. + +## GitHub data via `gh` + +`scripts/pr_review_pull.sh ` pulls these into the scratch outdir: + +| File | Endpoint | +|------|----------| +| `pr_gh_latest.json` | `gh pr view --json ...` | +| `pr_reviews_latest.json` | `repos/:owner/:repo/pulls//reviews` | +| `pr_review_comments_latest.json` | `repos/:owner/:repo/pulls//comments` (inline) | +| `pr_issue_comments_latest.json` | `repos/:owner/:repo/issues//comments` | +| `pr_files_latest.json` | `repos/:owner/:repo/pulls//files` | +| `pr_commits_latest.json` | `repos/:owner/:repo/pulls//commits` | +| `pr_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) | + +The REST inline-comments endpoint does not expose resolve/outdate state; the +GraphQL thread payload does. The builder joins them by comment `databaseId` +(== REST `id`), falling back to a `(path, body-prefix)` match when an older +thread dump lacks `databaseId`. + +GraphQL query used by the pull script: + +```graphql +query($owner:String!,$repo:String!,$pr:Int!){ + repository(owner:$owner,name:$repo){ + pullRequest(number:$pr){ + reviewThreads(first:100){ nodes{ + id isResolved isOutdated isCollapsed line originalLine path + comments(first:50){ nodes{ databaseId } } + }}}}} +``` + +## Scripts + +- `scripts/pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO]` - fetch. +- `scripts/build_digest.py [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue. + +Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore). diff --git a/.cursor/skills/review-curator-pr/scripts/README.md b/.cursor/skills/review-curator-pr/scripts/README.md new file mode 100644 index 0000000000..17f8f23d42 --- /dev/null +++ b/.cursor/skills/review-curator-pr/scripts/README.md @@ -0,0 +1,32 @@ +# Scripts - review-curator-pr + +Both require the GitHub CLI (`gh`) authenticated against github.com. They write +to a scratch directory (default `.curator-pr-review/`) that is safe to delete; +do not commit its contents. + +## pr_review_pull.sh + +```bash +.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO] +``` + +Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue +`comments`, `files`, `commits`) plus the GraphQL review threads into +`pr_*_latest.json` (and timestamped snapshots for delta analysis). + +## build_digest.py + +```bash +.cursor/skills/review-curator-pr/scripts/build_digest.py [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS] +``` + +Joins the `pr_*_latest.json` files and writes +`curator_pr_fresh_review_.md` + `curator_pr_github_comment_queue_.md`. + +- `--prev-head` sets the SHA in the "commits since prior reviewed head" header. +- `--baseline-ts ` (a prior pull's timestamp suffix) marks + comments/reviews/commits NEW vs already-seen. + +The thread join uses comment `databaseId` when present, else a +(path, body-prefix) fallback, so both GraphQL thread-dump shapes classify +correctly. diff --git a/.cursor/skills/review-curator-pr/scripts/build_digest.py b/.cursor/skills/review-curator-pr/scripts/build_digest.py new file mode 100755 index 0000000000..42f654d4e9 --- /dev/null +++ b/.cursor/skills/review-curator-pr/scripts/build_digest.py @@ -0,0 +1,279 @@ +#!/usr/bin/env python3 +"""Build a Curator PR fresh-review digest + open-comment queue. + +Usage: build_digest.py [--outdir DIR] [--today YYYY-MM-DD] + [--prev-head SHA] [--baseline-ts TS] + +Reads the pr_*_latest.json files written by pr_review_pull.sh from --outdir +(default .curator-pr-review) and writes into the same directory: + curator_pr_fresh_review_.md full digest + curator_pr_github_comment_queue_.md open-thread paste queue + +The GraphQL thread payload is joined to REST inline comments by `databaseId` +when present, otherwise by a (path, body-prefix) match, so both thread-dump +shapes classify correctly. Pass --baseline-ts (a prior pull's timestamp +suffix) to mark comments / reviews / commits NEW vs already-seen. +""" +from __future__ import annotations + +import argparse +import datetime as dt +import json +from pathlib import Path + +BODY_KEY_LEN = 120 + + +def load(path: Path) -> object: + return json.loads(path.read_text()) + + +def load_baseline_ids(baseline, key): + if baseline is None or not baseline.exists(): + return set() + return {entry[key] for entry in load(baseline) if key in entry} + + +def build_thread_index(threads_payload): + """Return {by_dbid, by_pathbody, nthreads}, robust to both GraphQL shapes.""" + idx = {"by_dbid": {}, "by_pathbody": {}, "nthreads": 0} + try: + nodes = threads_payload["data"]["repository"]["pullRequest"]["reviewThreads"]["nodes"] + except (KeyError, TypeError): + return idx + idx["nthreads"] = len(nodes) + for thread in nodes: + cnodes = thread["comments"]["nodes"] + path = thread.get("path") + for pos, c in enumerate(cnodes): + meta = { + "thread_id": thread.get("id"), + "is_resolved": thread.get("isResolved", False), + "is_outdated": thread.get("isOutdated", False), + "thread_comment_count": len(cnodes), + "is_first_in_thread": pos == 0, + } + db = c.get("databaseId") + if db is not None: + idx["by_dbid"][int(db)] = meta + body = (c.get("body") or "").strip() + if body: + idx["by_pathbody"].setdefault((path, body[:BODY_KEY_LEN]), meta) + return idx + + +def thread_meta(idx, c): + m = idx["by_dbid"].get(c["id"]) + if m is not None: + return m + body = (c.get("body") or "").strip() + return idx["by_pathbody"].get((c.get("path"), body[:BODY_KEY_LEN])) + + +def shorten(body, n=600): + body = (body or "").strip() + if len(body) <= n: + return body + return body[:n].rstrip() + "\n[...truncated...]" + + +def status_of(meta): + if meta is None: + return "ORPHAN" + if meta["is_resolved"]: + return "RESOLVED" + if meta["is_outdated"]: + return "OUTDATED" + return "OPEN" + + +def p(parts, *lines): + parts.extend(lines) + + +def build(pr, outdir, today, baseline_ts, prev_head): + def latest(kind): + return outdir / f"pr{pr}_{kind}_latest.json" + + gh = load(latest("gh")) + reviews = load(latest("reviews")) + review_comments = load(latest("review_comments")) + issue_comments = load(latest("issue_comments")) + files = load(latest("files")) + commits = load(latest("commits")) + threads_path = latest("review_threads") + idx = build_thread_index(load(threads_path)) if threads_path.exists() else build_thread_index({}) + + def base(kind): + return outdir / f"pr{pr}_{kind}_{baseline_ts}.json" if baseline_ts else None + + base_comment_ids = load_baseline_ids(base("review_comments"), "id") + base_review_ids = load_baseline_ids(base("reviews"), "id") + base_issue_ids = load_baseline_ids(base("issue_comments"), "id") + base_commit_shas = set() + bc = base("commits") + if bc and bc.exists(): + base_commit_shas = {c["sha"] for c in load(bc) if "sha" in c} + + head_oid = gh.get("headRefOid", "") + base_oid = gh.get("baseRefOid", "") + prev8 = (prev_head or "")[:8] or "(none)" + date_us = today.replace("-", "_") + + # ----- DIGEST ----- + d = [] + p(d, f"# Curator PR {pr} Fresh Review - {today}", "") + p(d, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "") + p(d, f"Current PR head reviewed: `{head_oid}`", "") + p(d, f"Base recorded by GitHub metadata: `{base_oid}`", "") + if prev_head: + p(d, f"Previous reviewed head: `{prev_head}`", "") + p(d, "") + + p(d, "## PR state at review time", "") + p(d, "| Field | Value |", "|---|---|") + p(d, f"| state | {gh.get('state')} |") + p(d, f"| isDraft | {gh.get('isDraft')} |") + p(d, f"| reviewDecision | {gh.get('reviewDecision') or '(none)'} |") + p(d, f"| mergeable / mergeStateStatus | {gh.get('mergeable')} / {gh.get('mergeStateStatus')} |") + p(d, f"| changedFiles | {gh.get('changedFiles')} |") + p(d, f"| additions / deletions | +{gh.get('additions')} / -{gh.get('deletions')} |") + p(d, f"| commits | {len(commits)} |") + p(d, f"| inline review comments total | {len(review_comments)} |") + p(d, f"| reviews submitted | {len(reviews)} |") + p(d, f"| top-level issue comments | {len(issue_comments)} |") + p(d, f"| updatedAt | {gh.get('updatedAt')} |", "") + + new_commits = [c for c in commits if c["sha"] not in base_commit_shas] if base_commit_shas else commits + p(d, f"## Commits since prior reviewed head `{prev8}` ({len(new_commits)} new)", "") + p(d, "```") + for c in new_commits: + sha = c["sha"][:8] + msg = (c["commit"]["message"] or "").splitlines()[0] + cdate = (c["commit"]["author"] or {}).get("date", "") + p(d, f"{sha} {msg} ({cdate[:10]})") + p(d, "```", "") + + p(d, "Files changed in current PR head (`git diff ..`):", "") + p(d, "| File | +/- | status |", "|---|---|---|") + for f in sorted(files, key=lambda x: x["filename"]): + p(d, f"| `{f['filename']}` | +{f['additions']} -{f['deletions']} | {f['status']} |") + p(d, "") + + p(d, "## Reviews", "") + review_by_id = {r["id"]: r for r in reviews} + for r in sorted(reviews, key=lambda x: x.get("submitted_at") or ""): + marker = " (NEW)" if base_review_ids and r["id"] not in base_review_ids else "" + commit = (r.get("commit_id") or "")[:8] or "n/a" + p(d, f"### #{r['id']} by @{r['user']['login']} state={r['state']} " + f"commit={commit} submitted={r.get('submitted_at')}{marker}") + body = (r.get("body") or "").strip() + if body: + p(d, "", shorten(body, 1500)) + p(d, "") + + p(d, "## Inline review comments", "") + p(d, f"Total: {len(review_comments)} comments across {idx['nthreads'] or '?'} threads.", "") + by_status = {"OPEN": [], "OUTDATED": [], "RESOLVED": [], "ORPHAN": []} + for c in review_comments: + by_status[status_of(thread_meta(idx, c))].append(c) + for s in ("OPEN", "OUTDATED", "RESOLVED", "ORPHAN"): + p(d, f"- {s}: {len(by_status[s])}") + p(d, "") + + by_path = {} + for c in review_comments: + by_path.setdefault(c["path"], []).append(c) + for path in sorted(by_path): + p(d, f"### `{path}`", "") + for c in sorted(by_path[path], key=lambda x: (x.get("line") or x.get("original_line") or 0, x["created_at"])): + meta = thread_meta(idx, c) + status = status_of(meta) + new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else "" + line = c.get("line") or c.get("original_line") + review = review_by_id.get(c.get("pull_request_review_id"), {}) + commit = (c.get("commit_id") or "")[:8] + in_reply = c.get("in_reply_to_id") + reply_marker = f" reply->{in_reply}" if in_reply else "" + tcount = meta["thread_comment_count"] if meta else 1 + tpos = " (root)" if meta and meta["is_first_in_thread"] else (" (reply)" if meta else "") + p(d, f"- **#{c['id']}** @{c['user']['login']} {c['created_at']} " + f"line={line} commit={commit} status=**{status}**{new} " + f"thread_comments={tcount}{tpos} review_state={review.get('state', '?')}{reply_marker}") + p(d, f" url: {c['html_url']}") + body = shorten(c.get("body") or "", 700).replace("\n", "\n > ") + p(d, f" > {body}", "") + p(d, "") + + p(d, "## Top-level issue comments", "") + for c in sorted(issue_comments, key=lambda x: x["created_at"]): + new = " NEW" if base_issue_ids and c["id"] not in base_issue_ids else "" + p(d, f"### #{c['id']} by @{c['user']['login']} {c['created_at']}{new}", "") + p(d, shorten(c.get("body") or "", 1500), "") + + digest_path = outdir / f"curator_pr{pr}_fresh_review_{date_us}.md" + digest_path.write_text("\n".join(d) + "\n", encoding="utf-8") + print(f"wrote {digest_path} ({digest_path.stat().st_size} bytes)") + + # ----- COMMENT QUEUE (open root threads only) ----- + q = [] + p(q, f"# Curator PR {pr} GitHub Comment Queue - {today}", "") + p(q, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "") + p(q, f"Current PR head: `{head_oid}`", "") + p(q, "Open inline review threads from the most recent reviewer pass that need an " + "author response or a code change. Outdated/resolved threads are listed at " + "the bottom for traceability only.", "") + + def is_root(c): + return (thread_meta(idx, c) or {}).get("is_first_in_thread", True) + + open_root = [c for c in review_comments if status_of(thread_meta(idx, c)) == "OPEN" and is_root(c)] + open_root.sort(key=lambda c: (c["path"], c.get("line") or c.get("original_line") or 0, c["created_at"])) + for i, c in enumerate(open_root, 1): + meta = thread_meta(idx, c) or {} + line = c.get("line") or c.get("original_line") + new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else "" + p(q, f"## Comment {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "") + p(q, f"Thread: {c['html_url']}", "") + if meta.get("thread_comment_count", 1) > 1: + p(q, f"Thread has {meta['thread_comment_count']} comments - see digest for replies.", "") + p(q, f"File: `{c['path']}`", "") + p(q, f"Line: `{line}`", "") + p(q, "Reviewer text:", "") + p(q, "```text", (c.get("body") or "").strip(), "```", "") + + p(q, "## Stale (outdated/resolved) comments", "") + p(q, "These threads pre-date the current head and are outdated/resolved. Listed " + "for traceability only - no action needed unless a reviewer reopens.", "") + stale = [c for c in review_comments + if status_of(thread_meta(idx, c)) in ("OUTDATED", "RESOLVED") and is_root(c)] + stale.sort(key=lambda c: (status_of(thread_meta(idx, c)), c["path"], c.get("original_line") or 0)) + for c in stale: + line = c.get("line") or c.get("original_line") + body = shorten((c.get("body") or "").strip(), 200).replace("\n", " ") + p(q, f"- [{status_of(thread_meta(idx, c))}] `{c['path']}:{line}` " + f"@{c['user']['login']} ({c['html_url']}): {body}") + p(q, "") + + queue_path = outdir / f"curator_pr{pr}_github_comment_queue_{date_us}.md" + queue_path.write_text("\n".join(q) + "\n", encoding="utf-8") + print(f"wrote {queue_path} ({queue_path.stat().st_size} bytes)") + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("pr", type=int, help="PR number") + ap.add_argument("--outdir", default=".curator-pr-review", + help="directory holding pr_*_latest.json (default .curator-pr-review)") + ap.add_argument("--today", default=None, help="ISO date for filenames; default today UTC") + ap.add_argument("--baseline-ts", default=None, + help="timestamp suffix of a prior pull to mark NEW vs already-seen") + ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta") + args = ap.parse_args() + today = args.today or dt.datetime.utcnow().date().isoformat() + build(args.pr, Path(args.outdir), today, args.baseline_ts, args.prev_head) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh new file mode 100755 index 0000000000..f14f649ed3 --- /dev/null +++ b/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh @@ -0,0 +1,92 @@ +#!/usr/bin/env bash +# Fetch GitHub data for an NVIDIA-NeMo/Curator PR review. +# +# Usage: pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO] +# +# Writes pr_*_latest.json (consumed by build_digest.py) plus timestamped +# snapshots so a prior pull is preserved for delta analysis. Pulls six REST +# endpoints and the GraphQL review threads (which carry isResolved/isOutdated). +# +# Requires the GitHub CLI (`gh`) authenticated against github.com. +set -euo pipefail + +PR="" +REPO="NVIDIA-NeMo/Curator" +OUTDIR=".curator-pr-review" + +while [[ $# -gt 0 ]]; do + case "$1" in + --outdir) OUTDIR="$2"; shift 2 ;; + --repo) REPO="$2"; shift 2 ;; + -h|--help) + echo "Usage: pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO]"; exit 0 ;; + *) PR="$1"; shift ;; + esac +done +[[ -n "${PR}" ]] || { echo "error: PR number required" >&2; exit 2; } + +command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; } + +mkdir -p "${OUTDIR}" +TS="$(date -u +%Y%m%dT%H%M%SZ)" +LOG="${OUTDIR}/pr${PR}_review_pull_${TS}.log" + +{ + echo "=== Pull start ${TS} PR=${PR} repo=${REPO} ===" + gh --version | head -1 +} | tee -a "${LOG}" + +pull_endpoint() { + local label="$1"; shift + local outfile="$1"; shift + echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}" + "$@" > "${outfile}" + printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}" +} + +GH_FIELDS="number,title,state,isDraft,mergeable,mergeStateStatus,headRefName,headRefOid,baseRefName,baseRefOid,additions,deletions,changedFiles,commits,reviewDecision,reviewRequests,labels,milestone,createdAt,updatedAt,closedAt,mergedAt,author,body,url,statusCheckRollup" + +pull_endpoint "pr ${PR} metadata" "${OUTDIR}/pr${PR}_gh_${TS}.json" \ + gh pr view "${PR}" --repo "${REPO}" --json "${GH_FIELDS}" +pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/reviews" +pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/comments" +pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \ + gh api --paginate "repos/${REPO}/issues/${PR}/comments" +pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/files" +pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/commits" + +OWNER="${REPO%%/*}" +NAME="${REPO##*/}" +pull_endpoint "graphql reviewThreads" "${OUTDIR}/pr${PR}_review_threads_${TS}.json" \ + gh api graphql \ + -f query='query($owner:String!,$repo:String!,$pr:Int!){ repository(owner:$owner,name:$repo){ pullRequest(number:$pr){ reviewThreads(first:100){ nodes{ id isResolved isOutdated isCollapsed line originalLine path comments(first:50){ nodes{ databaseId } } } } } } }' \ + -f owner="${OWNER}" -f repo="${NAME}" -F pr="${PR}" + +for kind in gh reviews review_comments issue_comments files commits review_threads; do + cp -f "${OUTDIR}/pr${PR}_${kind}_${TS}.json" "${OUTDIR}/pr${PR}_${kind}_latest.json" +done + +{ + echo "--- counts ---" + for kind in reviews review_comments issue_comments files commits; do + f="${OUTDIR}/pr${PR}_${kind}_${TS}.json" + n=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(len(d) if isinstance(d,list) else 1)" "${f}") + printf 'pr%s_%s: %s entries\n' "${PR}" "${kind}" "${n}" + done + echo "--- head SHA / activity ---" + python3 - "${OUTDIR}/pr${PR}_gh_latest.json" <<'PY' +import json, sys +gh = json.loads(open(sys.argv[1]).read()) +print(f"head_oid: {gh.get('headRefOid')}") +print(f"base_oid: {gh.get('baseRefOid')}") +print(f"state: {gh.get('state')} draft={gh.get('isDraft')} mergeStatus={gh.get('mergeStateStatus')} reviewDecision={gh.get('reviewDecision')}") +print(f"counts: files={gh.get('changedFiles')} +{gh.get('additions')}/-{gh.get('deletions')}") +print(f"updated_at: {gh.get('updatedAt')}") +PY +} | tee -a "${LOG}" + +echo "PR${PR}_REVIEW_PULL_DONE outdir=${OUTDIR} log=${LOG}" | tee -a "${LOG}" diff --git a/.cursor/skills/review-curator-pr/templates.md b/.cursor/skills/review-curator-pr/templates.md new file mode 100644 index 0000000000..0cc6a86db6 --- /dev/null +++ b/.cursor/skills/review-curator-pr/templates.md @@ -0,0 +1,134 @@ +# Output templates + +`scripts/build_digest.py` emits both files automatically; these templates show +the exact structure so hand edits and the added findings/verdict narrative stay +consistent. Replace `<...>` placeholders. Keep the structure stable across +review passes so re-reviews diff cleanly. + +## A. Fresh review digest + +File: `curator_pr_fresh_review_.md` + +```markdown +# Curator PR Fresh Review - + +Review target: https://github.com/NVIDIA-NeMo/Curator/pull/ + +Current PR head reviewed: `` + +Base recorded by GitHub metadata: `` + +Previous reviewed head: `` # if a prior review exists + +## PR state at review time + +| Field | Value | +|---|---| +| state | | +| isDraft | | +| reviewDecision | | +| mergeable / mergeStateStatus | <...> / <...> | +| changedFiles | | +| additions / deletions | + / - | +| commits | | +| inline review comments total | | +| reviews submitted | | +| top-level issue comments | | +| updatedAt | | + +## Commits since prior reviewed head `` ( new) + +``` + () +``` + +Files changed in current PR head (`git diff ..`): + +| File | +/- | status | +|---|---|---| +| `` | + - | | + +## Reviews + +### # by @ state= commit= submitted= + + +## Inline review comments + +Total: comments across threads. + +- OPEN: +- OUTDATED: +- RESOLVED: +- ORPHAN: + +### `` + +- **#** @ line= commit= status=**** thread_comments= () review_state= + url: + > "> + +## Top-level issue comments + +### # by @ + + +## Findings (reviewer analysis) + +### P0: +<evidence: path:line on current head> - <why blocker> - <recommended fix + test> + +### P1 / P2 / P3: <title> +<...> + +## Verdict +<merge-ready or not, and the blockers> +``` + +## B. GitHub comment queue + +File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. Only OPEN root +threads needing an author response or code change; stale threads at the bottom. + +```markdown +# Curator PR <N> GitHub Comment Queue - <YYYY-MM-DD> + +Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N> + +Current PR head: `<headRefOid>` + +## Comment <i> - <path>:<line> by @<login> + +Thread: <html_url> + +File: `<path>` + +Line: `<line>` + +Reviewer text: + +```text +<verbatim reviewer body> +``` + +## Stale (outdated/resolved) comments + +- [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body> +``` + +## C. Reply drafting style + +For each OPEN comment the user wants a paste-ready reply for: + +- Acknowledge briefly. +- State status: already addressed (cite the commit SHA) or will-fix. +- Cite the exact `path:line-range` on the current head as evidence. +- Describe the concrete mechanism, not just "fixed". +- Keep each reply self-contained; the reviewer reads it in isolation. + +Example: + +> Thanks for flagging. Addressed on the current head (`<sha8>`): the writer now +> drops the waveform key before serialisation and guards `json.dumps`, so a +> non-serialisable value reports the offending key instead of failing the batch. +> Citation: `nemo_curator/stages/audio/io/sharded_manifest_writer.py:96-111`. diff --git a/.gitignore b/.gitignore index 1ba2c180fd..f99780f76e 100644 --- a/.gitignore +++ b/.gitignore @@ -169,3 +169,6 @@ fern/product-docs/ # PDF parsing data (symlinked to external storage) pdf_parsing/ + +# NeMo Curator PR review skill scratch output +.curator-pr-review/ From 73e346daf72c4c24b7c5261978fec500dd5bb7b4 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Fri, 5 Jun 2026 12:52:34 +0530 Subject: [PATCH 02/10] Scope review-curator skill to the audio modality Rename the skill to review-curator-audio-pr and refocus its guidance on audio-modality PRs (nemo_curator/stages/audio, tutorials/audio, audio benchmarks, tests/stages/audio). - SKILL.md: audio scope, step 1 confirms the PR touches audio paths, audio review lenses, audio-oriented P0-P3 examples. - reference.md: audio stage map (io, preprocessing, inference asr/vad/ speaker_diarization, filtering, segmentation, tagging, postprocessing, metrics, datasets, alm, advanced_pipelines), audio tutorials, audio benchmarks, audio tests/fixtures; anchored on stages/audio/README.md. - recurring-themes.md: audio-specific lenses - waveform/tensor memory and manifest serialization, tarred/sharded I/O, optional audio deps, sample-rate and metadata propagation, throughput. - scripts/README.md: updated paths. Helper scripts are unchanged (they fetch any PR); the audio focus lives in the review lenses. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 155 ++++++++++++++++++ .../recurring-themes.md | 142 ++++++++++++++++ .../review-curator-audio-pr/reference.md | 105 ++++++++++++ .../scripts/README.md | 9 +- .../scripts/build_digest.py | 0 .../scripts/pr_review_pull.sh | 0 .../templates.md | 0 .cursor/skills/review-curator-pr/SKILL.md | 140 ---------------- .../review-curator-pr/recurring-themes.md | 127 -------------- .cursor/skills/review-curator-pr/reference.md | 86 ---------- 10 files changed, 407 insertions(+), 357 deletions(-) create mode 100644 .cursor/skills/review-curator-audio-pr/SKILL.md create mode 100644 .cursor/skills/review-curator-audio-pr/recurring-themes.md create mode 100644 .cursor/skills/review-curator-audio-pr/reference.md rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/scripts/README.md (69%) rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/scripts/build_digest.py (100%) rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/scripts/pr_review_pull.sh (100%) rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/templates.md (100%) delete mode 100644 .cursor/skills/review-curator-pr/SKILL.md delete mode 100644 .cursor/skills/review-curator-pr/recurring-themes.md delete mode 100644 .cursor/skills/review-curator-pr/reference.md diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md new file mode 100644 index 0000000000..646902b0fb --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -0,0 +1,155 @@ +--- +name: review-curator-audio-pr +description: >- + Review NVIDIA-NeMo/Curator audio-modality pull requests - audio stages + (readers/writers, preprocessing, VAD, ASR, speaker diarization, filtering, + segmentation, tagging, ALM), audio tutorials, and audio benchmarks. Pulls + fresh PR data with the GitHub CLI, diffs the changed code, applies Curator's + audio stage contracts and contribution standards, and emits a structured + review digest plus a paste-ready comment queue. Use when a contributor or + maintainer asks to review, re-review, or triage an audio Curator PR (e.g. + "review audio PR 1967", "what's still open on the diarization PR"), or to + draft replies to reviewer comments on an audio PR. +--- + +# Review NeMo Curator audio PRs + +Produce a defensible, reproducible review of an audio-modality +`NVIDIA-NeMo/Curator` PR from a fresh GitHub pull plus the code diff. Audience: +the PR author (who must respond to reviewers and land the change) and +maintainers triaging the audio queue. + +This skill is scoped to the **audio** modality: code under +`nemo_curator/stages/audio/`, `tutorials/audio/`, audio tests under +`tests/stages/audio/`, and audio benchmarks. For non-audio PRs, the stage- +contract lenses still apply but the audio-specific guidance below will not. + +Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a +local checkout of this repository (the one containing this skill). + +## Two deliverables (always) + +Each review run writes two Markdown files into a scratch directory (default +`.curator-pr-review/`, override with `--outdir`): + +1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state, + commits, changed-file table, reviews, every inline comment grouped by file + with OPEN / OUTDATED / RESOLVED status, and issue comments. +2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root + threads that need an author reply or code change, paste-ready, with stale + threads listed at the bottom for traceability. + +See [templates.md](templates.md) for the exact section layout. Keeping the +structure stable makes diffs across review passes mechanical, so a re-review +only surfaces what changed. + +## Workflow + +``` +- [ ] 1. Identify the PR and confirm it touches audio paths +- [ ] 2. Pull fresh GitHub data (gh) +- [ ] 3. Check out the PR head and diff against the base +- [ ] 4. Read the changed code through the audio review lenses +- [ ] 5. Generate the digest + comment queue +- [ ] 6. Classify findings P0-P3 and (optionally) draft replies +``` + +### Step 1 - Identify the PR + +`gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio +paths (`nemo_curator/stages/audio/`, `tutorials/audio/`, `tests/stages/audio/`, +`benchmarking/AUDIO_PROFILING.md` / `ALM_BENCHMARK.md`); if it does not, this +audio skill is the wrong lens. If a prior review digest for this PR exists, read +it first and record its head SHA as the baseline so commits and comments can be +marked NEW vs already-seen. + +### Step 2 - Pull fresh GitHub data + +Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so +prior pulls are preserved for delta analysis): + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N> +``` + +It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue +`comments`, `files`, `commits`) and the GraphQL review threads, which carry the +`isResolved` / `isOutdated` flags the REST inline endpoint omits. + +### Step 3 - Diff the code + +```bash +gh pr checkout <N> # or: git fetch origin pull/<N>/head:pr-<N> +git diff origin/main...HEAD # changed files on this PR +``` + +Always review the actual diff, never the comment text alone. Use `main` as the +baseline for "how the audio stages already do this". + +### Step 4 - Read through the audio review lenses + +Apply the lenses in [recurring-themes.md](recurring-themes.md): stage contracts, +setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo, +model utils), secret-safe logging, waveform/tensor memory and manifest +serialization, tarred/sharded audio I/O, sample-rate and metadata propagation, +streaming/throughput, tutorials/docs, tests/coverage, and PR reviewability. + +These build on the project's always-on rules in `.cursor/rules/` +(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`, +`executors`, `resources-configuration`, `pipeline-structure`, +`modality-structure`, `coding-standards`) and the audio stage guide at +`nemo_curator/stages/audio/README.md`. Treat those as canonical and cite them +when a change deviates. + +### Step 5 - Generate the two files + +```bash +.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD> +``` + +The builder joins the pulled JSON and classifies each comment by thread status: +**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**. +The comment queue lists only OPEN root threads. + +### Step 6 - Findings and replies + +In the digest's findings section, classify each issue by severity: + +- **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a + manifest writer that crashes on a kept waveform), a secret (HF token) leaked + into logs, or an import that breaks a supported install. +- **P1** - fix before merge: undeclared audio runtime dependencies, missing + required tests/coverage, environment-specific I/O in a reusable reader, a PR + too large to review safely. +- **P2** - should fix: duplicated pipeline construction, inefficient per-task + writer batching, incomplete throughput metrics, documented-but-unwired config. +- **P3** - nice to have: temporary model/compat workarounds needing a test + + comment, debug-only knobs (e.g. per-shard utterance caps) to document. + +When drafting replies, cite the exact `path:line-range` on the current head, +state whether the point is already addressed (with the commit SHA) or will be, +and keep each reply self-contained. See [templates.md](templates.md) section C. + +## Conventions to enforce (from CONTRIBUTING.md and rules) + +- **DCO sign-off** is required on every commit (`git commit -s`). +- **Tests** must accompany source changes; GPU-only audio tests use + `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`, + and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`. +- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file. +- **Ruff** lint/format, 119-char lines; **loguru** for logging. +- Supported Python: 3.10-3.12. + +## Knowledge sources + +Full index in [reference.md](reference.md). Quick map: + +| Need | Source | +|------|--------| +| Audio stage guide | `nemo_curator/stages/audio/README.md` | +| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` | +| Audio tutorials | `tutorials/audio/` | +| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` | +| Audio perf expectations | `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | +| Contribution rules | `CONTRIBUTING.md` | +| Live PR data | `gh` (see scripts) | diff --git a/.cursor/skills/review-curator-audio-pr/recurring-themes.md b/.cursor/skills/review-curator-audio-pr/recurring-themes.md new file mode 100644 index 0000000000..38b5c826e9 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/recurring-themes.md @@ -0,0 +1,142 @@ +# Review lenses - recurring themes in Curator audio PRs + +Apply each lens to the diff and cite exact `path:line` evidence. These extend +the always-on rules in `.cursor/rules/` and the audio guide at +`nemo_curator/stages/audio/README.md`; when a change violates a rule, link it by +name. + +## Stage contracts and reuse + +- An audio stage subclasses `ProcessingStage[InputTask, OutputTask]` with + explicit `inputs()` / `outputs()`, a meaningful `name`, and appropriate + `resources` / `batch_size` (set as class attributes / dataclass fields, never + as the read-only `_name` / `_resources` / `_batch_size` properties). +- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in + `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it + runs on the driver). +- Reuse existing audio readers, manifest writers, and shared model abstractions + instead of bespoke orchestration. Model a tarred/sharded reader as a + `CompositeStage` = a discovery stage plus a shard-reader fan-out. +- If `inputs()` does not declare a required key (e.g. `waveform_key`, + `sample_rate_key`), runtime validation will not check it, and a direct + `task.data[key]` access then raises an opaque `KeyError` instead of a clear + validation error. Declare required keys. +- Fan-out shard readers need deterministic child IDs and must propagate + `_metadata` / `_stage_perf`. First-stage shard discovery from an empty input + should constrain placement (e.g. one worker per node) so discovery is not + duplicated across every worker. + +## Setup / teardown lifecycle and audio dependencies + +- `setup_on_node()` = once-per-node prep (model download / shared cache). + `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees + GPU memory. Don't conflate them. +- Import heavy/optional audio deps (vLLM, NeMo, model-specific utils, + soundfile/lhotse, fasttext) lazily inside `setup()` or behind a guarded + import, never unconditionally at module top level, when they ship only with an + optional extra. A top-level `import <optional_pkg>` breaks + `import nemo_curator.stages.audio...` on any install without that extra + (including CPU-only / Mac/ARM). +- Declare the full audio runtime dependency set in the appropriate + `pyproject.toml` optional extra (e.g. an audio CUDA extra); add an import + smoke test under that extra. Don't rely on an external Docker image to supply + dependencies a reusable audio stage needs. +- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries + shared across modalities (`transformers`, `huggingface-hub`, `accelerate`); + prefer compatible ranges so other extras don't break. + +## Secret-safe logging + +- Never log a full resolved config; audio tutorials commonly accept an HF token + / cloud credentials. Redact secret-like keys before logging (token, password, + secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset, + and add a test asserting the secret value never appears in logged output. + +## Waveform/tensor memory and manifest serialization + +- Drop waveform arrays / audio tensors from `task.data` once a stage no longer + needs them. A leaked numpy `ndarray` or torch tensor that reaches a JSON + manifest writer raises `TypeError: Object of type ndarray is not JSON + serializable` and can fail the entire shard/batch, or cause OOM at scale. + Strip by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap + `json.dumps` so a stray non-serializable value reports the offending key. +- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for + non-regular members even after `isfile()` - check before `.read()`. Handle + resampling / channel-count / dtype mismatches explicitly. + +## Tarred / sharded audio I/O + +- Keep infrastructure assumptions (specific object-store URLs, auth env vars, + cluster-local paths) out of the generic reader/writer. Make remote access + pluggable: accept local paths, ordinary cloud URIs, or a user-provided + opener/resolver, and isolate any environment-specific support behind a small, + documented, optional component. For piped/streamed downloads use + failure-aware flags + retries so truncated streams fail loudly. +- Manifest/shard writers that loop per task inside `process_batch()` reopen the + output file repeatedly; group by shard and open each output once per batch. + Preserve `.done`/checkpoint markers so runs are resumable. + +## Streaming, throughput, benchmarking + +- Streaming execution with per-stage resource configuration is the default + model for scalable audio pipelines (see `executors.mdc`). +- Emit stage metrics through the supported metrics path (`metrics/`), JSON- + shaped. For multi-node audio runs include worker/node/GPU identity and latency + distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just + totals/averages. +- Standalone audio benchmark scripts belong in the `benchmarking/` flow + (`AUDIO_PROFILING.md`, `ALM_BENCHMARK.md`) with config entries and comparable + parameters - not freestanding. + +## Tutorials and docs + +- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public + APIs only - no private APIs, no machine-local absolute paths, no committed + noisy notebook output, no executor mismatch between a notebook and its CLI. + Prefer `uv` for environment setup if the tutorial standard does. +- Don't duplicate pipeline-construction logic between a tutorial entrypoint and + an `examples/` script; extract one shared builder both call. +- Every config key documented for users must actually reach a stage; validate + config-to-stage mapping and fail (or warn) on unused keys. +- Don't commit generated analysis/scratch docs into the source tree. + +## Tests and standards (CONTRIBUTING + coding-standards) + +- Source changes need accompanying tests; `tests/stages/audio/` mirrors + `nemo_curator/stages/audio/`. The project enforces 80% coverage under + `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures + under `tests/fixtures/audio/` and sample data under `tests/data/audio/`. +- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header. +- Ruff clean at 119-char lines; type annotations on functions; loguru for logs. +- Every commit is DCO signed-off. + +## PR size and reviewability + +- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by + stage / module / tutorial / benchmark (e.g. reader+writer plumbing, then + inference stages, then postprocessing, then tutorial/config). If a single PR + must remain, organize it as if sliced: no duplicated builder logic, clear + module ownership, a reviewable commit sequence. + +## Smaller recurring nits + +- Missing copyright headers on new files. +- Top-level imports that should be lazy/guarded (very common on audio model + modules). +- Long inline prompt/config strings that should be module constants. +- Dataclass fields validated in `__post_init__` that could just be typed. +- Dead conditionals / redundant truthiness checks. +- Pytests added for tutorials that don't warrant them. +- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features - + document them as debug throttles. +- `trust_remote_code=True` hardcoded on a model load - expose as a parameter so + security-conscious users can opt out. + +## Severity mapping + +| Severity | Meaning | +|----------|---------| +| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install | +| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large | +| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config | +| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs | diff --git a/.cursor/skills/review-curator-audio-pr/reference.md b/.cursor/skills/review-curator-audio-pr/reference.md new file mode 100644 index 0000000000..b6ccf9ee7a --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/reference.md @@ -0,0 +1,105 @@ +# Reference - NeMo Curator audio PR review + +All paths are relative to the repository root. The skill is self-contained: it +relies only on this repository plus the GitHub CLI (`gh`). Scope: the **audio** +modality. + +## Primary audio guide + +- `nemo_curator/stages/audio/README.md` - audio stage contracts, `process` vs + `process_batch`, lazy/optional imports, metrics. Read this first for any + audio review. + +## Canonical contracts (`.cursor/rules/`) + +Always-on rules describing how stages and pipelines must be written. Cite them +when a PR deviates. + +| Rule | Covers | +|------|--------| +| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks | +| `composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) | +| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow | +| `executors.mdc` | Xenna / Ray Data backends and when to use each | +| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation | +| `pipeline-structure.mdc` | Pipeline assembly | +| `modality-structure.mdc` | Where modality code/tests/tutorials live | +| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | + +## Audio stage map (diff ground truth) + +`nemo_curator/stages/audio/` is organized as: + +| Area | Path | +|------|------| +| Base contract | `nemo_curator/stages/base.py` | +| I/O (tarred/sharded readers, manifest writers) | `nemo_curator/stages/audio/io/` | +| Preprocessing | `nemo_curator/stages/audio/preprocessing/` | +| Inference: ASR / VAD / speaker diarization | `nemo_curator/stages/audio/inference/{asr,vad,speaker_diarization}/` | +| Filtering (band, SigMOS, etc.) | `nemo_curator/stages/audio/filtering/` | +| Segmentation (speaker separation) | `nemo_curator/stages/audio/segmentation/` | +| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/` | +| Postprocessing | `nemo_curator/stages/audio/postprocessing/` | +| Metrics / throughput | `nemo_curator/stages/audio/metrics/` | +| Datasets (fleurs, readspeech) | `nemo_curator/stages/audio/datasets/` | +| ALM (audio-language pretrain) | `nemo_curator/stages/audio/alm/` | +| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/` | + +Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`, +`ray_data/`, `ray_actor_pool/`); pipeline assembly: +`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`. + +## Audio tutorials + +`tutorials/audio/` - `alm/`, `audio_pretrain/`, `callhome_diar/`, `fleurs/`, +`readspeech/`, `single_speaker_filter/`, `tagging/`. Tutorials must be runnable, +use public APIs only, and not duplicate pipeline-building logic that belongs in +a shared builder. + +## Audio performance / benchmarking + +- `benchmarking/README.md` - how benchmarks are wired and run. +- `benchmarking/AUDIO_PROFILING.md` - audio benchmark / perf expectations. +- `benchmarking/ALM_BENCHMARK.md` - ALM pipeline benchmark. + +Standalone audio benchmark scripts should be integrated into this flow, not left +freestanding. + +## Audio tests + +`tests/stages/audio/` mirrors `nemo_curator/stages/audio/` (io, inference, +filtering, segmentation, tagging, postprocessing, preprocessing, metrics, +datasets, alm, advanced_pipelines). Audio fixtures live under +`tests/fixtures/audio/` and `tests/data/audio/`. GPU-only tests are marked +`@pytest.mark.gpu`. + +## Contribution rules + +- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test + requirements, coverage threshold, copyright header. + +## GitHub data via `gh` + +`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir: + +| File | Endpoint | +|------|----------| +| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` | +| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` | +| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) | +| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` | +| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` | +| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` | +| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) | + +The REST inline-comments endpoint does not expose resolve/outdate state; the +GraphQL thread payload does. The builder joins them by comment `databaseId` +(== REST `id`), falling back to a `(path, body-prefix)` match when an older +thread dump lacks `databaseId`. + +## Scripts + +- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch. +- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue. + +Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore). diff --git a/.cursor/skills/review-curator-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md similarity index 69% rename from .cursor/skills/review-curator-pr/scripts/README.md rename to .cursor/skills/review-curator-audio-pr/scripts/README.md index 17f8f23d42..b729dee15c 100644 --- a/.cursor/skills/review-curator-pr/scripts/README.md +++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md @@ -1,4 +1,4 @@ -# Scripts - review-curator-pr +# Scripts - review-curator-audio-pr Both require the GitHub CLI (`gh`) authenticated against github.com. They write to a scratch directory (default `.curator-pr-review/`) that is safe to delete; @@ -7,7 +7,7 @@ do not commit its contents. ## pr_review_pull.sh ```bash -.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO] +.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO] ``` Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue @@ -17,7 +17,7 @@ Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue ## build_digest.py ```bash -.cursor/skills/review-curator-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS] +.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS] ``` Joins the `pr<N>_*_latest.json` files and writes @@ -29,4 +29,5 @@ Joins the `pr<N>_*_latest.json` files and writes The thread join uses comment `databaseId` when present, else a (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify -correctly. +correctly. The scripts fetch any PR; the audio focus is applied in the review +lenses (see ../recurring-themes.md). diff --git a/.cursor/skills/review-curator-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py similarity index 100% rename from .cursor/skills/review-curator-pr/scripts/build_digest.py rename to .cursor/skills/review-curator-audio-pr/scripts/build_digest.py diff --git a/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh similarity index 100% rename from .cursor/skills/review-curator-pr/scripts/pr_review_pull.sh rename to .cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh diff --git a/.cursor/skills/review-curator-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md similarity index 100% rename from .cursor/skills/review-curator-pr/templates.md rename to .cursor/skills/review-curator-audio-pr/templates.md diff --git a/.cursor/skills/review-curator-pr/SKILL.md b/.cursor/skills/review-curator-pr/SKILL.md deleted file mode 100644 index cf6a7eb657..0000000000 --- a/.cursor/skills/review-curator-pr/SKILL.md +++ /dev/null @@ -1,140 +0,0 @@ ---- -name: review-curator-pr -description: >- - Review NVIDIA-NeMo/Curator pull requests against the project's stage - contracts and contribution standards. Pulls fresh PR data with the GitHub - CLI, diffs the changed code, applies Curator's documented review lenses, and - emits a structured review digest plus a paste-ready comment queue. Use when a - contributor or maintainer asks to review, re-review, or triage a Curator PR - (e.g. "review PR 1967", "what's still open on PR 1898", "triage this PR"), or - to draft replies to reviewer comments. ---- - -# Review NeMo Curator PRs - -Produce a defensible, reproducible review of an `NVIDIA-NeMo/Curator` PR from a -fresh GitHub pull plus the code diff. The audience is the PR author (who must -respond to reviewers and land the change) and maintainers triaging the queue. - -Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a -local checkout of this repository (the one containing this skill). - -## Two deliverables (always) - -Each review run writes two Markdown files into a scratch directory (default -`.curator-pr-review/`, override with `--outdir`): - -1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state, - commits, changed-file table, reviews, every inline comment grouped by file - with OPEN / OUTDATED / RESOLVED status, and issue comments. -2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root - threads that need an author reply or code change, paste-ready, with stale - threads listed at the bottom for traceability. - -See [templates.md](templates.md) for the exact section layout. Keeping the -structure stable makes diffs across review passes mechanical, so a re-review -only surfaces what changed. - -## Workflow - -``` -- [ ] 1. Identify the PR number -- [ ] 2. Pull fresh GitHub data (gh) -- [ ] 3. Check out the PR head and diff against the base -- [ ] 4. Read the changed code through the review lenses -- [ ] 5. Generate the digest + comment queue -- [ ] 6. Classify findings P0-P3 and (optionally) draft replies -``` - -### Step 1 - Identify the PR - -Confirm the target: `gh pr view <N> --repo NVIDIA-NeMo/Curator`. If a prior -review digest for this PR exists, read it first and record its head SHA as the -baseline so commits and comments can be marked NEW vs already-seen. - -### Step 2 - Pull fresh GitHub data - -Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so -prior pulls are preserved for delta analysis): - -```bash -.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <N> -``` - -It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue -`comments`, `files`, `commits`) and the GraphQL review threads, which carry the -`isResolved` / `isOutdated` flags the REST inline endpoint omits. - -### Step 3 - Diff the code - -```bash -gh pr checkout <N> # or: git fetch origin pull/<N>/head:pr-<N> -git diff origin/main...HEAD # changed files on this PR -``` - -Always review the actual diff, never the comment text alone. Use `main` as the -baseline for "how the project already does this". - -### Step 4 - Read through the review lenses - -Apply the lenses in [recurring-themes.md](recurring-themes.md): stage -contracts, setup/teardown lifecycle, dependency and optional-extra hygiene, -secret-safe logging, memory/serialization of large payloads, streaming and -performance, tutorials/docs, tests/coverage, and PR size/reviewability. - -These build on the project's own always-on rules in `.cursor/rules/` -(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`, -`executors`, `resources-configuration`, `pipeline-structure`, -`modality-structure`, `coding-standards`) - treat those as the canonical -contract and cite them when a change deviates. - -### Step 5 - Generate the two files - -```bash -.cursor/skills/review-curator-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD> -``` - -The builder joins the pulled JSON and classifies each comment by thread status: -**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**. -The comment queue lists only OPEN root threads. - -### Step 6 - Findings and replies - -In the digest's findings section, classify each issue by severity: - -- **P0** - merge blocker: data loss / crash on a valid config, a secret leaked - into logs, or an import that breaks a supported install. -- **P1** - fix before merge: undeclared runtime dependencies, missing required - tests/coverage, environment-specific code in a reusable path, a PR too large - to review safely. -- **P2** - should fix: duplicated logic, inefficient batch handling, incomplete - metrics/observability, documented-but-unwired config. -- **P3** - nice to have: temporary workarounds that need a test + comment, - debug-only knobs to document. - -When drafting replies, cite the exact `path:line-range` on the current head, -state whether the point is already addressed (with the commit SHA) or will be, -and keep each reply self-contained. See [templates.md](templates.md) section C. - -## Conventions to enforce (from CONTRIBUTING.md and rules) - -- **DCO sign-off** is required on every commit (`git commit -s`). -- **Tests** must accompany source changes; GPU-only tests use - `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`, - and `tests/` mirrors the `nemo_curator/` layout. -- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file. -- **Ruff** lint/format, 119-char lines; **loguru** for logging. -- Supported Python: 3.10-3.12. - -## Knowledge sources - -Full index in [reference.md](reference.md). Quick map: - -| Need | Source | -|------|--------| -| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` | -| Modality guides | `nemo_curator/stages/<modality>/README.md`, `tutorials/<modality>/README.md` | -| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` | -| Perf expectations | `benchmarking/` | -| Contribution rules | `CONTRIBUTING.md` | -| Live PR data | `gh` (see scripts) | diff --git a/.cursor/skills/review-curator-pr/recurring-themes.md b/.cursor/skills/review-curator-pr/recurring-themes.md deleted file mode 100644 index d0599d2276..0000000000 --- a/.cursor/skills/review-curator-pr/recurring-themes.md +++ /dev/null @@ -1,127 +0,0 @@ -# Review lenses - recurring themes in Curator PRs - -Apply each lens to the diff and cite exact `path:line` evidence. These extend -the always-on rules in `.cursor/rules/`; when a change violates a rule, link the -rule by name. Most points are modality-agnostic; audio examples are -illustrative. - -## Stage contracts and reuse - -- A stage subclasses `ProcessingStage[InputTask, OutputTask]` with explicit - `inputs()` / `outputs()`, a meaningful `name`, and appropriate `resources` / - `batch_size` (set as class attributes / dataclass fields, never as the - read-only `_name` / `_resources` / `_batch_size` properties). -- GPU or vectorizable work belongs in `process_batch()` with a real - `batch_size`; `__init__` stays lightweight (it runs on the driver). -- Reuse existing readers, writers, and shared model abstractions instead of - bespoke orchestration. Use `CompositeStage` to present several internal stages - as one user-facing stage. -- If `inputs()` does not declare a required key, runtime validation will not - check it, and a direct `task.data[key]` access then raises an opaque - `KeyError` instead of a clear validation error. Declare required keys. -- Fan-out stages (one task -> many) need deterministic child IDs and propagated - `_metadata` / `_stage_perf`. First-stage discovery from an empty input should - constrain placement so work is not duplicated across every worker. - -## Setup / teardown lifecycle and dependencies - -- `setup_on_node()` = once-per-node prep (e.g. model downloads / shared cache). - `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees - resources. Don't conflate them. -- Import heavy or optional dependencies lazily inside `setup()` (or behind a - guarded import), never unconditionally at module top level, when the dependency - ships only with an optional extra. A top-level `import <optional_pkg>` breaks - `import nemo_curator...` on any install without that extra. -- Declare the full runtime dependency set in the appropriate `pyproject.toml` - optional extra; add an import smoke test under that extra. Don't rely on an - external Docker image to supply dependencies a reusable stage needs. -- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries - shared across modalities (e.g. `transformers`, `huggingface-hub`, - `accelerate`); prefer compatible ranges so other extras don't break. - -## Secret-safe logging - -- Never log a full resolved config object; it can contain tokens / credentials - (e.g. an HF token, access keys). Redact secret-like keys before logging, or - log only a non-secret subset, and add a test asserting the secret value never - appears in logged output. - -## Memory and serialization of large payloads - -- Drop large in-memory payloads (waveform arrays, image/video tensors) from - `task.data` once a stage no longer needs them. Leaked arrays reach downstream - JSON writers and raise `TypeError: Object of type ndarray is not JSON - serializable` (which can fail an entire batch) or cause memory blowups. Strip - by key plus a duck-typed guard, and wrap `json.dumps` so a stray - non-serializable value reports the offending key instead of crashing. -- Guard external I/O edge cases (e.g. `tarfile.extractfile()` returns `None` for - non-regular members even after `isfile()`; check before `.read()`). - -## Streaming, performance, benchmarking - -- Streaming execution with per-stage resource configuration is the default model - for scalable pipelines (see `executors.mdc`). -- Emit stage metrics through the supported metrics path; keep records - JSON-shaped. For multi-node analysis, include worker/node/GPU identity and - latency distributions (p50/p95/p99), not just totals/averages. -- Writers that loop per task inside `process_batch()` reopen output files - repeatedly; group by output target and open once per batch. -- Standalone benchmark scripts belong in the `benchmarking/` flow with config - entries and comparable parameters - not freestanding. - -## Environment-specific code in reusable paths - -- Keep infrastructure assumptions (specific object-store URLs, auth env vars, - cluster-specific paths) out of generic reader/writer code. Make remote access - pluggable: accept local paths, ordinary cloud URIs, or a user-provided - opener/resolver, and isolate any environment-specific support behind a small, - documented, optional component. - -## Tutorials and docs - -- Tutorials must be runnable, minimal, and use public APIs only - no private - APIs, no machine-local absolute paths, no committed noisy notebook output, no - executor mismatch between a notebook and its CLI. Prefer `uv` for environment - setup if the tutorial standard does. -- Don't duplicate pipeline-construction logic between a tutorial entrypoint and - an `examples/` script; extract one shared builder both call. -- Every config key documented for users must actually reach a stage; validate - config-to-stage mapping and fail (or warn) on unused keys. -- Don't commit generated analysis/scratch docs into the source tree. - -## Tests and standards (CONTRIBUTING + coding-standards) - -- Source changes need accompanying tests; `tests/` mirrors `nemo_curator/`. The - project enforces 80% coverage under `nemo_curator/`. GPU-only tests use - `@pytest.mark.gpu`. -- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header. -- Ruff clean at 119-char lines; type annotations on functions; loguru for logs. -- Every commit is DCO signed-off. - -## PR size and reviewability - -- Large multi-purpose PRs are hard to review and to defend. Prefer splitting by - stage / module / tutorial / benchmark. If a single PR must remain, organize it - as if sliced: no duplicated builder logic, clear module ownership, and a - reviewable commit sequence. - -## Smaller recurring nits - -- Missing copyright headers on new files. -- Top-level imports that should be lazy/guarded. -- Long inline strings (prompts, configs) that should be module constants. -- Dataclass fields validated in `__post_init__` that could just be typed. -- Dead conditionals / redundant truthiness checks. -- Tests added for tutorials that don't warrant them. -- Debug-only knobs presented as features - document them as debug throttles. -- `trust_remote_code=True` (or similar) hardcoded - expose as a parameter so - security-conscious users can opt out. - -## Severity mapping - -| Severity | Meaning | -|----------|---------| -| P0 | Merge blocker: crash/data-loss on a valid config, secret leaked to logs, import breaks a supported install | -| P1 | Fix before merge: undeclared deps, missing required tests/coverage, env-specific code in reusable path, PR too large | -| P2 | Should fix: duplicated logic, inefficient batching, incomplete metrics, documented-but-unwired config | -| P3 | Nice to have: temporary workaround needing test+comment, debug knob docs | diff --git a/.cursor/skills/review-curator-pr/reference.md b/.cursor/skills/review-curator-pr/reference.md deleted file mode 100644 index 40bdd0c0cc..0000000000 --- a/.cursor/skills/review-curator-pr/reference.md +++ /dev/null @@ -1,86 +0,0 @@ -# Reference - NeMo Curator PR review - -All paths are relative to the repository root. The skill is self-contained: it -relies only on this repository plus the GitHub CLI (`gh`). - -## Canonical contracts (`.cursor/rules/`) - -These always-on rules are the authoritative description of how stages and -pipelines must be written. Cite them when a PR deviates. - -| Rule | Covers | -|------|--------| -| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks | -| `composite-stage-patterns.mdc` | `CompositeStage` decomposition | -| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow | -| `executors.mdc` | Xenna / Ray Data backends and when to use each | -| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation | -| `pipeline-structure.mdc` | Pipeline assembly | -| `modality-structure.mdc` | Where modality code/tests/tutorials live | -| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | - -## Framework code (diff ground truth) - -- `nemo_curator/stages/base.py` - `ProcessingStage` base class. -- `nemo_curator/backends/` - `base.py`, `xenna/`, `ray_data/`, - `ray_actor_pool/`; adapters define how stages run on each backend. -- `nemo_curator/pipeline/pipeline.py` - pipeline construction and guards. -- `nemo_curator/tasks/` - task types. - -## Modality guides - -| Modality | Stage README | Tutorials | -|----------|--------------|-----------| -| audio | `nemo_curator/stages/audio/README.md` | `tutorials/audio/` | -| interleaved | `nemo_curator/stages/interleaved/README.md` | `tutorials/interleaved/` | -| text / image / video / math / synthetic | (stage code under `nemo_curator/stages/`) | `tutorials/{text,image,video,math,synthetic}/` | - -## Performance / benchmarking - -- `benchmarking/README.md` - how benchmarks are wired and run. -- `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` - modality - perf expectations. Standalone benchmark scripts should be integrated here, not - left freestanding. - -## Contribution rules - -- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test - requirements, coverage threshold, copyright header. - -## GitHub data via `gh` - -`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir: - -| File | Endpoint | -|------|----------| -| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` | -| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` | -| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) | -| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` | -| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` | -| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` | -| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) | - -The REST inline-comments endpoint does not expose resolve/outdate state; the -GraphQL thread payload does. The builder joins them by comment `databaseId` -(== REST `id`), falling back to a `(path, body-prefix)` match when an older -thread dump lacks `databaseId`. - -GraphQL query used by the pull script: - -```graphql -query($owner:String!,$repo:String!,$pr:Int!){ - repository(owner:$owner,name:$repo){ - pullRequest(number:$pr){ - reviewThreads(first:100){ nodes{ - id isResolved isOutdated isCollapsed line originalLine path - comments(first:50){ nodes{ databaseId } } - }}}}} -``` - -## Scripts - -- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch. -- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue. - -Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore). From 4de281f41863aa8bb39443a0bd7e70e7f7119083 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Fri, 5 Jun 2026 14:42:17 +0530 Subject: [PATCH 03/10] Clarify review-curator-audio-pr is a reviewer tool for reviewing others PRs Reframe SKILL.md, templates.md, scripts and generated digest/threads output so it is unambiguous the skill is used by a reviewer to review someone elses audio Curator PR, not by an author responding to reviewers. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 132 +++++++++--------- .../review-curator-audio-pr/reference.md | 2 +- .../review-curator-audio-pr/scripts/README.md | 11 +- .../scripts/build_digest.py | 26 ++-- .../review-curator-audio-pr/templates.md | 104 +++++++------- 5 files changed, 145 insertions(+), 130 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index 646902b0fb..345bc07af7 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -1,57 +1,60 @@ --- name: review-curator-audio-pr description: >- - Review NVIDIA-NeMo/Curator audio-modality pull requests - audio stages - (readers/writers, preprocessing, VAD, ASR, speaker diarization, filtering, - segmentation, tagging, ALM), audio tutorials, and audio benchmarks. Pulls - fresh PR data with the GitHub CLI, diffs the changed code, applies Curator's - audio stage contracts and contribution standards, and emits a structured - review digest plus a paste-ready comment queue. Use when a contributor or - maintainer asks to review, re-review, or triage an audio Curator PR (e.g. - "review audio PR 1967", "what's still open on the diarization PR"), or to - draft replies to reviewer comments on an audio PR. + Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is + a reviewer's tool: given a PR number, it pulls the PR, diffs the changed audio + code, applies Curator's audio stage contracts and contribution standards, + shows what other reviewers have already raised (so you don't duplicate), and + helps you produce a structured set of review findings (P0-P3) to post as + review comments. Use when you are assigned or pick up an audio Curator PR and + need to review it (e.g. "review audio PR 1967", "do a review of this Curator + audio PR", "what should I flag on PR 1898"). It is NOT for a PR author + responding to a review. --- # Review NeMo Curator audio PRs -Produce a defensible, reproducible review of an audio-modality -`NVIDIA-NeMo/Curator` PR from a fresh GitHub pull plus the code diff. Audience: -the PR author (who must respond to reviewers and land the change) and -maintainers triaging the audio queue. +**Use this when you are the reviewer and need to review someone else's audio +Curator PR.** Given a PR number, it helps you understand the change, see what +other reviewers already said, and write your own review. It does not help a PR +author reply to reviewers - it helps you *produce* the review. -This skill is scoped to the **audio** modality: code under -`nemo_curator/stages/audio/`, `tutorials/audio/`, audio tests under -`tests/stages/audio/`, and audio benchmarks. For non-audio PRs, the stage- -contract lenses still apply but the audio-specific guidance below will not. +Scope: the **audio** modality - code under `nemo_curator/stages/audio/`, +`tutorials/audio/`, audio tests under `tests/stages/audio/`, and audio +benchmarks. For non-audio PRs the stage-contract lenses still apply, but the +audio-specific guidance below will not. Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a local checkout of this repository (the one containing this skill). -## Two deliverables (always) +## What you produce -Each review run writes two Markdown files into a scratch directory (default -`.curator-pr-review/`, override with `--outdir`): +Your review = a set of **findings** (P0-P3), each tied to a `path:line` on the +PR's current head, that you post as PR review comments. The skill writes two +helper files into a scratch directory (default `.curator-pr-review/`, override +with `--outdir`) to support that: -1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state, - commits, changed-file table, reviews, every inline comment grouped by file - with OPEN / OUTDATED / RESOLVED status, and issue comments. -2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root - threads that need an author reply or code change, paste-ready, with stale - threads listed at the bottom for traceability. +1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: PR + state, commits, changed-file table, and every existing review/comment grouped + by file with OPEN / OUTDATED / RESOLVED status. Read it to understand the PR + and to see what other reviewers already raised so you don't repeat them; add + your own findings at the bottom. +2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**: + a condensed list of the threads other reviewers left that are still + unresolved on the current head. Scan it before writing your own comments. -See [templates.md](templates.md) for the exact section layout. Keeping the -structure stable makes diffs across review passes mechanical, so a re-review -only surfaces what changed. +See [templates.md](templates.md) for the exact layout, including how to write up +your findings. ## Workflow ``` - [ ] 1. Identify the PR and confirm it touches audio paths - [ ] 2. Pull fresh GitHub data (gh) -- [ ] 3. Check out the PR head and diff against the base -- [ ] 4. Read the changed code through the audio review lenses -- [ ] 5. Generate the digest + comment queue -- [ ] 6. Classify findings P0-P3 and (optionally) draft replies +- [ ] 3. Check out the PR head and read the diff +- [ ] 4. Review the changed code through the audio review lenses +- [ ] 5. Generate the digest + prior-threads file for context +- [ ] 6. Write up your findings (P0-P3) and post them as review comments ``` ### Step 1 - Identify the PR @@ -59,14 +62,11 @@ only surfaces what changed. `gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio paths (`nemo_curator/stages/audio/`, `tutorials/audio/`, `tests/stages/audio/`, `benchmarking/AUDIO_PROFILING.md` / `ALM_BENCHMARK.md`); if it does not, this -audio skill is the wrong lens. If a prior review digest for this PR exists, read -it first and record its head SHA as the baseline so commits and comments can be -marked NEW vs already-seen. +audio skill is the wrong lens. ### Step 2 - Pull fresh GitHub data -Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so -prior pulls are preserved for delta analysis): +Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots): ```bash .cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N> @@ -74,66 +74,70 @@ prior pulls are preserved for delta analysis): It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue `comments`, `files`, `commits`) and the GraphQL review threads, which carry the -`isResolved` / `isOutdated` flags the REST inline endpoint omits. +`isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this +so you can see prior review activity, not because you own the PR. -### Step 3 - Diff the code +### Step 3 - Read the diff ```bash gh pr checkout <N> # or: git fetch origin pull/<N>/head:pr-<N> -git diff origin/main...HEAD # changed files on this PR +git diff origin/main...HEAD # the change you are reviewing ``` Always review the actual diff, never the comment text alone. Use `main` as the baseline for "how the audio stages already do this". -### Step 4 - Read through the audio review lenses +### Step 4 - Review through the audio lenses Apply the lenses in [recurring-themes.md](recurring-themes.md): stage contracts, -setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo, -model utils), secret-safe logging, waveform/tensor memory and manifest -serialization, tarred/sharded audio I/O, sample-rate and metadata propagation, +setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo, model +utils), secret-safe logging, waveform/tensor memory and manifest serialization, +tarred/sharded audio I/O, sample-rate and metadata propagation, streaming/throughput, tutorials/docs, tests/coverage, and PR reviewability. These build on the project's always-on rules in `.cursor/rules/` (`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`, `executors`, `resources-configuration`, `pipeline-structure`, `modality-structure`, `coding-standards`) and the audio stage guide at -`nemo_curator/stages/audio/README.md`. Treat those as canonical and cite them -when a change deviates. +`nemo_curator/stages/audio/README.md`. Cite these when the PR deviates. -### Step 5 - Generate the two files +### Step 5 - Generate the context files ```bash .cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD> ``` -The builder joins the pulled JSON and classifies each comment by thread status: -**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**. -The comment queue lists only OPEN root threads. +The builder joins the pulled JSON and classifies each existing comment by thread +status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head), +**RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer +already made. -### Step 6 - Findings and replies +### Step 6 - Write up and post your findings -In the digest's findings section, classify each issue by severity: +Record each issue you found as a finding, classified by severity, and post them +as a PR review (inline comments for specific lines, a top-level summary for +overall verdict). Each finding should cite the exact `path:line-range` on the +current head and propose a concrete fix. - **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a manifest writer that crashes on a kept waveform), a secret (HF token) leaked into logs, or an import that breaks a supported install. -- **P1** - fix before merge: undeclared audio runtime dependencies, missing - required tests/coverage, environment-specific I/O in a reusable reader, a PR - too large to review safely. -- **P2** - should fix: duplicated pipeline construction, inefficient per-task +- **P1** - should fix before merge: undeclared audio runtime dependencies, + missing required tests/coverage, environment-specific I/O in a reusable + reader, a PR too large to review safely. +- **P2** - worth fixing: duplicated pipeline construction, inefficient per-task writer batching, incomplete throughput metrics, documented-but-unwired config. - **P3** - nice to have: temporary model/compat workarounds needing a test + comment, debug-only knobs (e.g. per-shard utterance caps) to document. -When drafting replies, cite the exact `path:line-range` on the current head, -state whether the point is already addressed (with the commit SHA) or will be, -and keep each reply self-contained. See [templates.md](templates.md) section C. +See [templates.md](templates.md) section C for how to phrase review comments. -## Conventions to enforce (from CONTRIBUTING.md and rules) +## What to check the PR against (CONTRIBUTING.md and rules) -- **DCO sign-off** is required on every commit (`git commit -s`). -- **Tests** must accompany source changes; GPU-only audio tests use +A compliant audio PR must satisfy these; flag any that are missing: + +- **DCO sign-off** on every commit (`git commit -s`). +- **Tests** accompany source changes; GPU-only audio tests use `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`, and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`. - **NVIDIA Apache-2.0 copyright header** on every non-empty Python file. @@ -151,5 +155,5 @@ Full index in [reference.md](reference.md). Quick map: | Audio tutorials | `tutorials/audio/` | | Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` | | Audio perf expectations | `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | -| Contribution rules | `CONTRIBUTING.md` | +| Contribution rules to check against | `CONTRIBUTING.md` | | Live PR data | `gh` (see scripts) | diff --git a/.cursor/skills/review-curator-audio-pr/reference.md b/.cursor/skills/review-curator-audio-pr/reference.md index b6ccf9ee7a..42aac6f78f 100644 --- a/.cursor/skills/review-curator-audio-pr/reference.md +++ b/.cursor/skills/review-curator-audio-pr/reference.md @@ -100,6 +100,6 @@ thread dump lacks `databaseId`. ## Scripts - `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch. -- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue. +- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the working digest + the prior-open-threads context file (you add your own findings). Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore). diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md index b729dee15c..d270082407 100644 --- a/.cursor/skills/review-curator-audio-pr/scripts/README.md +++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md @@ -1,5 +1,9 @@ # Scripts - review-curator-audio-pr +Helpers for a **reviewer** reviewing someone else's audio Curator PR. They +pull the PR and summarise existing review activity as context; you write the +actual review findings. + Both require the GitHub CLI (`gh`) authenticated against github.com. They write to a scratch directory (default `.curator-pr-review/`) that is safe to delete; do not commit its contents. @@ -20,8 +24,11 @@ Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue .cursor/skills/review-curator-audio-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS] ``` -Joins the `pr<N>_*_latest.json` files and writes -`curator_pr<N>_fresh_review_<date>.md` + `curator_pr<N>_github_comment_queue_<date>.md`. +Joins the `pr<N>_*_latest.json` files and writes two context files: +`curator_pr<N>_fresh_review_<date>.md` (working digest: PR state, diff, and +existing reviews/comments with OPEN/OUTDATED/RESOLVED status, plus a placeholder +for your findings) and `curator_pr<N>_github_comment_queue_<date>.md` (the +still-open threads other reviewers left, so you don't duplicate them). - `--prev-head` sets the SHA in the "commits since prior reviewed head" header. - `--baseline-ts <TS>` (a prior pull's timestamp suffix) marks diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py index 42f654d4e9..8c129103e1 100755 --- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py @@ -160,7 +160,7 @@ def base(kind): p(d, f"| `{f['filename']}` | +{f['additions']} -{f['deletions']} | {f['status']} |") p(d, "") - p(d, "## Reviews", "") + p(d, "## Existing reviews (by other reviewers)", "") review_by_id = {r["id"]: r for r in reviews} for r in sorted(reviews, key=lambda x: x.get("submitted_at") or ""): marker = " (NEW)" if base_review_ids and r["id"] not in base_review_ids else "" @@ -172,7 +172,7 @@ def base(kind): p(d, "", shorten(body, 1500)) p(d, "") - p(d, "## Inline review comments", "") + p(d, "## Existing inline comments (by other reviewers)", "") p(d, f"Total: {len(review_comments)} comments across {idx['nthreads'] or '?'} threads.", "") by_status = {"OPEN": [], "OUTDATED": [], "RESOLVED": [], "ORPHAN": []} for c in review_comments: @@ -205,24 +205,32 @@ def base(kind): p(d, f" > {body}", "") p(d, "") - p(d, "## Top-level issue comments", "") + p(d, "## Existing issue comments (by other reviewers)", "") for c in sorted(issue_comments, key=lambda x: x["created_at"]): new = " NEW" if base_issue_ids and c["id"] not in base_issue_ids else "" p(d, f"### #{c['id']} by @{c['user']['login']} {c['created_at']}{new}", "") p(d, shorten(c.get("body") or "", 1500), "") + p(d, "## My findings (your review)", "") + p(d, "_Add your findings here as you review. Classify each P0-P3, cite " + "path:line on the current head, and propose a concrete fix. See " + "templates.md section C._", "") + p(d, "## Verdict", "") + p(d, "_APPROVE / COMMENT / REQUEST CHANGES + blockers._", "") + digest_path = outdir / f"curator_pr{pr}_fresh_review_{date_us}.md" digest_path.write_text("\n".join(d) + "\n", encoding="utf-8") print(f"wrote {digest_path} ({digest_path.stat().st_size} bytes)") # ----- COMMENT QUEUE (open root threads only) ----- q = [] - p(q, f"# Curator PR {pr} GitHub Comment Queue - {today}", "") + p(q, f"# Curator PR {pr} Open Review Threads - {today}", "") p(q, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "") p(q, f"Current PR head: `{head_oid}`", "") - p(q, "Open inline review threads from the most recent reviewer pass that need an " - "author response or a code change. Outdated/resolved threads are listed at " - "the bottom for traceability only.", "") + p(q, "Threads other reviewers left that are still unresolved on the current " + "head. Scan these before adding your own comments so you do not duplicate, " + "and check whether the author addressed them. Stale (outdated/resolved) " + "threads are listed at the bottom for context only.", "") def is_root(c): return (thread_meta(idx, c) or {}).get("is_first_in_thread", True) @@ -233,10 +241,10 @@ def is_root(c): meta = thread_meta(idx, c) or {} line = c.get("line") or c.get("original_line") new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else "" - p(q, f"## Comment {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "") + p(q, f"## Thread {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "") p(q, f"Thread: {c['html_url']}", "") if meta.get("thread_comment_count", 1) > 1: - p(q, f"Thread has {meta['thread_comment_count']} comments - see digest for replies.", "") + p(q, f"Thread has {meta['thread_comment_count']} comments - see the digest for the full thread.", "") p(q, f"File: `{c['path']}`", "") p(q, f"Line: `{line}`", "") p(q, "Reviewer text:", "") diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md index 0cc6a86db6..f789a09aad 100644 --- a/.cursor/skills/review-curator-audio-pr/templates.md +++ b/.cursor/skills/review-curator-audio-pr/templates.md @@ -1,13 +1,15 @@ # Output templates -`scripts/build_digest.py` emits both files automatically; these templates show -the exact structure so hand edits and the added findings/verdict narrative stay -consistent. Replace `<...>` placeholders. Keep the structure stable across -review passes so re-reviews diff cleanly. +You are the reviewer. `scripts/build_digest.py` generates files A and B as +context; you author the findings (section C) and post them as your review. +Replace `<...>` placeholders. Keep the structure stable across review passes so +a re-review diffs cleanly and only surfaces what changed. -## A. Fresh review digest +## A. Working digest (generated, then you append findings) -File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` +File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. The generated part gives you +PR state, the diff, and all existing review activity. Add your own **Findings** +section at the end - that is your review. ```markdown # Curator PR <N> Fresh Review - <YYYY-MM-DD> @@ -18,117 +20,111 @@ Current PR head reviewed: `<headRefOid>` Base recorded by GitHub metadata: `<baseRefOid>` -Previous reviewed head: `<prior_head_sha>` # if a prior review exists - ## PR state at review time | Field | Value | |---|---| | state | <OPEN/CLOSED/MERGED> | -| isDraft | <bool> | | reviewDecision | <APPROVED/CHANGES_REQUESTED/none> | -| mergeable / mergeStateStatus | <...> / <...> | | changedFiles | <n> | | additions / deletions | +<a> / -<d> | | commits | <n> | | inline review comments total | <n> | | reviews submitted | <n> | -| top-level issue comments | <n> | | updatedAt | <iso> | -## Commits since prior reviewed head `<prior8>` (<k> new) +## Commits (<k>) ``` <sha8> <subject> (<date>) ``` -Files changed in current PR head (`git diff <base>..<head>`): +Files changed (`git diff <base>..<head>`): | File | +/- | status | |---|---|---| | `<path>` | +<a> -<d> | <added/modified> | -## Reviews +## Existing reviews (by other reviewers) ### #<id> by @<login> state=<STATE> commit=<sha8> submitted=<iso> <body or empty> -## Inline review comments +## Existing inline comments (by other reviewers) Total: <n> comments across <m> threads. -- OPEN: <n> +- OPEN: <n> # still unresolved - do NOT duplicate these - OUTDATED: <n> - RESOLVED: <n> -- ORPHAN: <n> ### `<path>` -- **#<id>** @<login> <iso> line=<L> commit=<sha8> status=**<OPEN/OUTDATED/RESOLVED>** thread_comments=<k> (<root/reply>) review_state=<STATE> +- **#<id>** @<login> <iso> line=<L> status=**<OPEN/OUTDATED/RESOLVED>** ... url: <html_url> - > <body, truncated, each line prefixed with "> "> - -## Top-level issue comments - -### #<id> by @<login> <iso> -<body, truncated> + > <body, truncated> -## Findings (reviewer analysis) +## My findings (your review) ### P0: <title> -<evidence: path:line on current head> - <why blocker> - <recommended fix + test> +<evidence: path:line on current head> - <why it's a blocker> - <concrete fix> ### P1 / P2 / P3: <title> <...> ## Verdict -<merge-ready or not, and the blockers> +<APPROVE / COMMENT / REQUEST CHANGES, and the blockers if any> ``` -## B. GitHub comment queue +## B. Prior open threads (generated context) -File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. Only OPEN root -threads needing an author response or code change; stale threads at the bottom. +File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. The still-unresolved +threads other reviewers already opened, so you can scan prior feedback before +adding your own. This is **context, not your output** - you are not replying to +these as the author; you read them to avoid duplicating and to check whether the +author addressed them. ```markdown -# Curator PR <N> GitHub Comment Queue - <YYYY-MM-DD> - -Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N> +# Curator PR <N> Open Review Threads - <YYYY-MM-DD> Current PR head: `<headRefOid>` -## Comment <i> - <path>:<line> by @<login> +Threads other reviewers left that are still unresolved on the current head. -Thread: <html_url> +## Thread <i> - <path>:<line> by @<login> -File: `<path>` - -Line: `<line>` - -Reviewer text: +Link: <html_url> ```text <verbatim reviewer body> ``` -## Stale (outdated/resolved) comments - +## Stale (outdated/resolved) threads - [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body> ``` -## C. Reply drafting style +## C. Writing your review comments + +Turn each finding into a review comment you post on the PR. For each: -For each OPEN comment the user wants a paste-ready reply for: +- Anchor it to the exact `path:line-range` on the current head. +- Lead with severity (P0-P3) and a one-line statement of the problem. +- Explain *why* it matters (cite a `.cursor/rules/*.mdc` contract or + `stages/audio/README.md` when the change deviates). +- Propose a concrete fix, ideally with a `suggestion` block. +- Before posting, check the "OPEN" list in file A - if another reviewer already + raised it, don't repeat it (optionally +1 their thread instead). -- Acknowledge briefly. -- State status: already addressed (cite the commit SHA) or will-fix. -- Cite the exact `path:line-range` on the current head as evidence. -- Describe the concrete mechanism, not just "fixed". -- Keep each reply self-contained; the reviewer reads it in isolation. +Example review comment: -Example: +> **P0 (blocker).** `sharded_manifest_writer.py:101` calls `json.dumps(record)` +> while `record` can still hold the waveform `ndarray` when `keep_waveform=True`, +> which raises `TypeError: Object of type ndarray is not JSON serializable` and +> fails the whole shard. Drop the waveform key before serialising and guard +> `json.dumps` so a stray non-serialisable value names the offending key. +> Please also add a test under `tests/stages/audio/io/` covering +> `keep_waveform=True`. -> Thanks for flagging. Addressed on the current head (`<sha8>`): the writer now -> drops the waveform key before serialisation and guards `json.dumps`, so a -> non-serialisable value reports the offending key instead of failing the batch. -> Citation: `nemo_curator/stages/audio/io/sharded_manifest_writer.py:96-111`. +Then post the set as one PR review: inline comments for line-specific findings, +a top-level summary carrying the overall verdict (APPROVE / COMMENT / REQUEST +CHANGES). From eceb1c992d73bb7aeb965e7babbbcffb66246d2a Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Fri, 5 Jun 2026 15:08:54 +0530 Subject: [PATCH 04/10] Add repo auto-detect, reviewer prompt, audio knowledge sources, and post-1608 PR corpus ensure_repo.sh reuses an existing checkout or shallow-clones (no full history). SKILL.md gains an explicit human-reviewer prompt and a repo-locate step. recurring-themes.md and reference.md are merged into knowledge-sources.md, with each review lens linked to the audio code, README section, .cursor/rules contract, and the developer-guide slides. Adds pull_audio_pr_corpus.sh + build_corpus.py to consolidate reviewer feedback from audio PRs after 1608 as pre-review context. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 127 ++++--- .../knowledge-sources.md | 309 ++++++++++++++++++ .../recurring-themes.md | 142 -------- .../review-curator-audio-pr/reference.md | 105 ------ .../review-curator-audio-pr/scripts/README.md | 43 ++- .../scripts/build_corpus.py | 159 +++++++++ .../scripts/ensure_repo.sh | 57 ++++ .../scripts/pull_audio_pr_corpus.sh | 84 +++++ 8 files changed, 718 insertions(+), 308 deletions(-) create mode 100644 .cursor/skills/review-curator-audio-pr/knowledge-sources.md delete mode 100644 .cursor/skills/review-curator-audio-pr/recurring-themes.md delete mode 100644 .cursor/skills/review-curator-audio-pr/reference.md create mode 100755 .cursor/skills/review-curator-audio-pr/scripts/build_corpus.py create mode 100755 .cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh create mode 100755 .cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index 345bc07af7..30078000b0 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -2,14 +2,14 @@ name: review-curator-audio-pr description: >- Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is - a reviewer's tool: given a PR number, it pulls the PR, diffs the changed audio - code, applies Curator's audio stage contracts and contribution standards, - shows what other reviewers have already raised (so you don't duplicate), and - helps you produce a structured set of review findings (P0-P3) to post as - review comments. Use when you are assigned or pick up an audio Curator PR and - need to review it (e.g. "review audio PR 1967", "do a review of this Curator - audio PR", "what should I flag on PR 1898"). It is NOT for a PR author - responding to a review. + a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo, + pulls the PR, diffs the changed audio code, applies Curator's audio stage + contracts and contribution standards, shows what other reviewers have already + raised (so you don't duplicate), and helps you produce a structured set of + review findings (P0-P3) to post as review comments. Use when you are assigned + or pick up an audio Curator PR and need to review it (e.g. "review audio PR + 1967", "do a review of this Curator audio PR", "what should I flag on PR + 1898"). It is NOT for a PR author responding to a review. --- # Review NeMo Curator audio PRs @@ -24,8 +24,26 @@ Scope: the **audio** modality - code under `nemo_curator/stages/audio/`, benchmarks. For non-audio PRs the stage-contract lenses still apply, but the audio-specific guidance below will not. -Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a -local checkout of this repository (the one containing this skill). +Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and +`git`. You do **not** need to pre-clone the repo - step 0 reuses any checkout +you already have and only shallow-clones +[NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found. + +## The reviewer prompt + +A human reviewer invokes this skill with a prompt like: + +> **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or +> shallow-clone the repo, pull the PR and prior review activity, apply the audio +> review lenses, and give me a review digest plus my own findings (P0-P3) with +> exact `path:line` evidence and concrete fixes, ready to post as PR comments. +> Skip anything other reviewers already raised."** + +Variants: *"re-review PR <N> and show only what changed since the last review"*, +*"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and +review the smallest one first"*. First-time in a repo, you can prime context +with *"build the post-#1608 audio review corpus first, then review PR <N>"* +(see [knowledge-sources.md](knowledge-sources.md) section 4). ## What you produce @@ -36,70 +54,80 @@ with `--outdir`) to support that: 1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: PR state, commits, changed-file table, and every existing review/comment grouped - by file with OPEN / OUTDATED / RESOLVED status. Read it to understand the PR - and to see what other reviewers already raised so you don't repeat them; add - your own findings at the bottom. + by file with OPEN / OUTDATED / RESOLVED status, plus a placeholder for your + findings. Read it to understand the PR and to see what other reviewers already + raised so you don't repeat them; record your findings at the bottom. 2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**: a condensed list of the threads other reviewers left that are still unresolved on the current head. Scan it before writing your own comments. -See [templates.md](templates.md) for the exact layout, including how to write up -your findings. +See [templates.md](templates.md) for the exact layout and how to phrase your +review comments. The review lenses and every doc/code reference live in +[knowledge-sources.md](knowledge-sources.md). ## Workflow ``` +- [ ] 0. Locate or shallow-clone the Curator repo - [ ] 1. Identify the PR and confirm it touches audio paths - [ ] 2. Pull fresh GitHub data (gh) -- [ ] 3. Check out the PR head and read the diff +- [ ] 3. Read the diff - [ ] 4. Review the changed code through the audio review lenses - [ ] 5. Generate the digest + prior-threads file for context - [ ] 6. Write up your findings (P0-P3) and post them as review comments ``` +### Step 0 - Locate or shallow-clone the repo + +Don't clone unnecessarily. This helper checks whether you are already inside a +Curator checkout, searches the current directory tree for one, and only then +shallow-clones (`--depth 1`, no full history): + +```bash +eval "$(scripts/ensure_repo.sh | tail -1)" # sets CURATOR_REPO=<path> +cd "$CURATOR_REPO" +``` + +If you are already in the checkout that contains this skill, you can skip this. + ### Step 1 - Identify the PR `gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio -paths (`nemo_curator/stages/audio/`, `tutorials/audio/`, `tests/stages/audio/`, -`benchmarking/AUDIO_PROFILING.md` / `ALM_BENCHMARK.md`); if it does not, this -audio skill is the wrong lens. +paths (`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`, +`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks); if it does not, +this audio skill is the wrong lens. ### Step 2 - Pull fresh GitHub data -Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots): - ```bash .cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N> ``` -It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue +Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue `comments`, `files`, `commits`) and the GraphQL review threads, which carry the `isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this -so you can see prior review activity, not because you own the PR. +to see prior review activity, not because you own the PR. ### Step 3 - Read the diff ```bash -gh pr checkout <N> # or: git fetch origin pull/<N>/head:pr-<N> -git diff origin/main...HEAD # the change you are reviewing +gh pr diff <N> --repo NVIDIA-NeMo/Curator # works on a shallow clone +# optional, to run/inspect locally: gh pr checkout <N> ``` -Always review the actual diff, never the comment text alone. Use `main` as the -baseline for "how the audio stages already do this". +`gh pr diff` fetches the patch straight from GitHub, so it works even on the +shallow clone from step 0. Always review the actual diff, never the comment text +alone. Use `main` as the baseline for "how the audio stages already do this". ### Step 4 - Review through the audio lenses -Apply the lenses in [recurring-themes.md](recurring-themes.md): stage contracts, -setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo, model -utils), secret-safe logging, waveform/tensor memory and manifest serialization, -tarred/sharded audio I/O, sample-rate and metadata propagation, -streaming/throughput, tutorials/docs, tests/coverage, and PR reviewability. - -These build on the project's always-on rules in `.cursor/rules/` -(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`, -`executors`, `resources-configuration`, `pipeline-structure`, -`modality-structure`, `coding-standards`) and the audio stage guide at -`nemo_curator/stages/audio/README.md`. Cite these when the PR deviates. +Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 - +each lens links the audio code, README section, and `.cursor/rules` contract it +governs: stage contracts, setup/teardown lifecycle, audio optional-dependency +hygiene (vLLM, NeMo, model utils), secret-safe logging, waveform/tensor memory +and manifest serialization, tarred/sharded audio I/O, sample-rate and metadata +propagation, streaming/throughput, tutorials/docs, tests/coverage, and PR +reviewability. Cite the linked source by name whenever the PR deviates. ### Step 5 - Generate the context files @@ -114,10 +142,10 @@ already made. ### Step 6 - Write up and post your findings -Record each issue you found as a finding, classified by severity, and post them -as a PR review (inline comments for specific lines, a top-level summary for -overall verdict). Each finding should cite the exact `path:line-range` on the -current head and propose a concrete fix. +Record each issue as a finding, classified by severity, and post them as a PR +review (inline comments for specific lines, a top-level summary for the overall +verdict). Each finding cites the exact `path:line-range` on the current head and +proposes a concrete fix. - **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a manifest writer that crashes on a kept waveform), a secret (HF token) leaked @@ -146,14 +174,7 @@ A compliant audio PR must satisfy these; flag any that are missing: ## Knowledge sources -Full index in [reference.md](reference.md). Quick map: - -| Need | Source | -|------|--------| -| Audio stage guide | `nemo_curator/stages/audio/README.md` | -| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` | -| Audio tutorials | `tutorials/audio/` | -| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` | -| Audio perf expectations | `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | -| Contribution rules to check against | `CONTRIBUTING.md` | -| Live PR data | `gh` (see scripts) | +[knowledge-sources.md](knowledge-sources.md) is the single index: canonical docs +(audio README, developer-guide slides, `.cursor/rules`, tutorials, benchmarks), +the audio code map, the review lenses with per-concept code references, the +post-#1608 PR corpus workflow, and the GitHub/`gh` data reference. diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md new file mode 100644 index 0000000000..cb76080537 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md @@ -0,0 +1,309 @@ +# Audio PR review knowledge sources + +Read this before reviewing. It is the single index for an audio Curator review +and has five parts: + +0. **Canonical docs** - the authoritative references to ground every finding. +1. **Audio code map** - where each audio stage lives. +2. **Review lenses** - what to check, each linked to the audio code, README + section, and `.cursor/rules` contract it governs. +3. **Severity mapping** - P0-P3. +4. **Pre-review corpus** - pull every audio PR after #1608 (open + closed) with + reviewer comments, so you learn the recurring feedback before you review. +5. **GitHub data + scripts reference** - the `gh` endpoints and helper scripts. + +All paths are repo-relative. The skill is self-contained: this repository plus +the GitHub CLI (`gh`). Scope: the **audio** modality. + +--- + +## 0. Canonical docs (start here) + +| Source | What it gives you | +|--------|-------------------| +| `nemo_curator/stages/audio/README.md` | The audio stage **developer guide** (1000+ lines): CPU vs GPU stages, `process` / `process_batch`, `batch_size`, call chains, per-stage memory characteristics, end-to-end FLEURS & ALM traces, new-stage checklist. Read first. | +| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). | +| `nemo_curator/tasks/audio_task.py` | `AudioTask(Task[dict])` + `_AttrDict` task model (one manifest entry per task; dict keys exposed as attributes for `validate_input`). | +| `nemo_curator/stages/base.py` | `ProcessingStage` contract: `inputs()`/`outputs()`, `process`/`process_batch`, `setup_on_node`/`setup`/`teardown`, `validate_input`. | +| `CONTRIBUTING.md` | DCO sign-off, PR process, required tests, coverage threshold, copyright header. | +| `.cursor/rules/*.mdc` | Always-on engineering contracts (table below). | +| `tutorials/audio/README.md` + per-pipeline READMEs | Runnable reference pipelines (alm, audio_pretrain, callhome_diar, fleurs, readspeech, single_speaker_filter, tagging). | +| `benchmarking/README.md`, `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | How audio/ALM benchmarks are wired and the perf expectations to hold a PR to. | +| `tests/stages/audio/` | Test layout that mirrors `nemo_curator/stages/audio/`; the coverage bar. | + +### `.cursor/rules/` contracts + +| Rule | Covers | +|------|--------| +| `.cursor/rules/processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks | +| `.cursor/rules/composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) | +| `.cursor/rules/task-patterns.mdc` | `Task` / typed task subclasses, metadata flow | +| `.cursor/rules/executors.mdc` | Xenna / Ray Data backends and when to use each | +| `.cursor/rules/resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation | +| `.cursor/rules/pipeline-structure.mdc` | Pipeline assembly | +| `.cursor/rules/modality-structure.mdc` | Where modality code/tests/tutorials live | +| `.cursor/rules/coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | + +--- + +## 1. Audio code map (diff ground truth) + +`nemo_curator/stages/audio/` is organized as: + +| Area | Path(s) | +|------|---------| +| Task model | `nemo_curator/tasks/audio_task.py` | +| Base contract | `nemo_curator/stages/base.py` | +| Common CPU stages (duration, value filter) | `nemo_curator/stages/audio/common.py` | +| I/O (convert to DocumentBatch, segment extraction) | `nemo_curator/stages/audio/io/convert.py`, `nemo_curator/stages/audio/io/extract_segments.py` | +| Preprocessing | `nemo_curator/stages/audio/preprocessing/mono_conversion.py`, `nemo_curator/stages/audio/preprocessing/concatenation.py` | +| Inference - ASR | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` | +| Inference - VAD | `nemo_curator/stages/audio/inference/vad/whisperx_vad.py` | +| Inference - speaker diarization | `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py` | +| Filtering (band, SigMOS, UTMOS) | `nemo_curator/stages/audio/filtering/band.py`, `nemo_curator/stages/audio/filtering/sigmos.py`, `nemo_curator/stages/audio/filtering/utmos.py` | +| Segmentation | `nemo_curator/stages/audio/segmentation/vad_segmentation.py`, `nemo_curator/stages/audio/segmentation/speaker_separation.py` | +| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py`, `nemo_curator/stages/audio/tagging/merge_alignment_diarization.py`, `nemo_curator/stages/audio/tagging/resample_audio.py`, `nemo_curator/stages/audio/tagging/split.py`, `nemo_curator/stages/audio/tagging/text/itn.py` | +| Postprocessing | `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py` | +| Metrics / throughput | `nemo_curator/stages/audio/metrics/wer.py`, `nemo_curator/stages/audio/metrics/bandwidth.py`, `nemo_curator/stages/audio/metrics/squim.py` | +| Datasets (FLEURS, ReadSpeech) | `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/readspeech/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/file_utils.py` | +| ALM (audio-language model data) | `nemo_curator/stages/audio/alm/alm_data_builder.py`, `nemo_curator/stages/audio/alm/alm_data_overlap.py`, `nemo_curator/stages/audio/alm/pretrain/pipeline.py`, `nemo_curator/stages/audio/alm/pretrain/io.py` | +| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/audio_data_filter/audio_data_filter.py` | + +Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`, +`ray_data/`, `ray_actor_pool/`); pipeline assembly: +`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`. + +--- + +## 2. Review lenses (each with code + doc references) + +Apply each lens to the diff and cite exact `path:line` evidence. When a change +violates a contract, link it by name. The **References** under each lens are the +canonical sources to cite. + +### 2.1 Stage contracts and reuse + +- An audio stage subclasses `ProcessingStage[AudioTask, AudioTask]` (or the + appropriate in/out task types) with explicit `inputs()` / `outputs()`, a + meaningful `name`, and appropriate `resources` / `batch_size` (set as class + attributes / dataclass fields, never the read-only `_name` / `_resources` / + `_batch_size` properties). Field order follows the base convention: `name` + first, then params, then `resources`, then `batch_size` last. +- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in + `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it + runs on the driver). Batch-only stages raise `NotImplementedError` from + `process()` (the convention used by dedup stages). +- Reuse existing audio readers, manifest writers, and shared model abstractions + instead of bespoke orchestration. Model a tarred/sharded reader as a + `CompositeStage` = a discovery stage + a shard-reader fan-out. +- If `inputs()` does not declare a required key (e.g. `audio_filepath_key`, + `duration_key`), runtime validation won't check it, and a direct + `task.data[key]` access then raises an opaque `KeyError` instead of a clear + validation error. Declare required keys. +- Fan-out shard readers need deterministic child IDs and must propagate + `_metadata` / `_stage_perf`; declare `ray_stage_spec()` with + `IS_FANOUT_STAGE: True`. First-stage shard discovery from an empty input + should constrain placement (e.g. one worker per node) so discovery isn't + duplicated across workers. +- **References:** `nemo_curator/stages/base.py`; `nemo_curator/tasks/audio_task.py`; `nemo_curator/stages/audio/common.py`; `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`; rules `processing-stage-patterns.mdc`, `composite-stage-patterns.mdc`, `task-patterns.mdc`; README sections "Writing a CPU stage" / "Writing a GPU stage". + +### 2.2 Setup / teardown lifecycle and audio dependencies + +- `setup_on_node()` = once-per-node prep (model download / shared cache / output + truncation). `setup()` = per-worker init (model load, GPU allocation). + `teardown()` frees GPU memory. Don't conflate them; prefer `setup_on_node` for + anything that must run exactly once (avoids `_setup_done`-style guards). +- Import heavy/optional audio deps (NeMo, vLLM, whisperx, pyannote, soundfile, + model-specific utils) lazily inside `setup()` or behind a guarded import, + never unconditionally at module top level. A top-level `import <optional_pkg>` + breaks `import nemo_curator.stages.audio...` on any install without that extra + (including CPU-only / Mac/ARM). +- Declare the full audio runtime dependency set in the appropriate + `pyproject.toml` optional extra (e.g. a CUDA extra); add an import smoke test. + Don't rely on an external Docker image to supply deps a reusable stage needs. +- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries + shared across modalities (`transformers`, `huggingface-hub`, `accelerate`); + prefer compatible ranges. +- Expose model knobs like `cache_dir` and `trust_remote_code` as parameters + rather than hardcoding them. +- **References:** `nemo_curator/stages/audio/inference/asr/asr_nemo.py` (download-on-node / load-on-worker split, `cache_dir`); `nemo_curator/stages/audio/inference/vad/whisperx_vad.py`; `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `pyannote.py`; `nemo_curator/stages/audio/filtering/{utmos,sigmos,band}.py`; `pyproject.toml`; rules `resources-configuration.mdc`; README "Stage lifecycle". + +### 2.3 Secret-safe logging + +- Never log a full resolved config; audio tutorials commonly accept an HF token + / cloud credentials. Redact secret-like keys before logging (token, password, + secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset, + and add a test asserting the secret value never appears in logged output. +- **References:** `tutorials/audio/*/` (token/credential inputs); any stage that + logs configuration; rules `coding-standards.mdc` (loguru). + +### 2.4 Waveform/tensor memory and manifest serialization + +- Drop waveform arrays / audio tensors from `task.data` once a stage no longer + needs them. A leaked numpy `ndarray` / torch tensor that reaches a JSON + manifest writer raises `TypeError: Object of type ndarray is not JSON + serializable` and can fail an entire shard/batch, or cause OOM at scale. Strip + by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap + `json.dumps` so a stray non-serializable value reports the offending key. +- ALM stages that replace the whole dict must `task.data.clear()` + + `task.data.update(result)` to preserve the `_AttrDict` type. Remember + `_AttrDict.__setattr__` routes writes into the dict, so never stamp private + fields onto a task's data dict. +- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for + non-regular members even after `isfile()` - check before `.read()`. Handle + resampling / channel-count / dtype mismatches explicitly. +- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics". + +### 2.5 Tarred / sharded audio I/O + +- Keep infrastructure assumptions (specific object-store URLs, auth env vars, + cluster-local paths) out of the generic reader/writer. Make remote access + pluggable: accept local paths, ordinary cloud URIs, or a user-provided + opener/resolver, and isolate environment-specific support behind a small, + documented, optional component. For piped/streamed downloads use failure-aware + flags + retries so truncated streams fail loudly. +- Cache the fsspec filesystem object in `setup_on_node` rather than re-creating + it per `process()` call (a new HTTP connection per manifest entry is costly on + S3/GCS at scale). +- Manifest/shard writers that loop per task inside `process_batch()` reopen the + output repeatedly; group by shard and open each output once per batch. Preserve + `.done`/checkpoint markers so runs are resumable. +- **References:** `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/datasets/{fleurs,readspeech}/create_initial_manifest.py`; `nemo_curator/stages/audio/datasets/file_utils.py`; `nemo_curator/stages/audio/alm/pretrain/io.py`; rules `composite-stage-patterns.mdc`. + +### 2.6 Sample-rate and metadata propagation + +- Resampling/mono-conversion/segmentation stages must update the sample-rate and + duration keys they change and propagate `_metadata` / `_stage_perf`; downstream + stages that assume a fixed sample rate must declare it in `inputs()`. +- **References:** `nemo_curator/stages/audio/preprocessing/mono_conversion.py`; `nemo_curator/stages/audio/tagging/resample_audio.py`; `nemo_curator/stages/audio/segmentation/vad_segmentation.py`; `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py`. + +### 2.7 Streaming, throughput, benchmarking + +- Streaming execution with per-stage resource configuration is the default model + for scalable audio pipelines. +- Emit stage metrics through the supported metrics path, JSON-shaped. For + multi-node audio runs include worker/node/GPU identity and latency + distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just + totals/averages. +- Standalone audio benchmark scripts belong in the `benchmarking/` flow with + config entries and comparable parameters - not freestanding. +- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`. + +### 2.8 Tutorials and docs + +- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public + APIs only - no private APIs, no machine-local absolute paths, no committed + noisy notebook output, no executor mismatch between a notebook and its CLI. +- Don't duplicate pipeline-construction logic between a tutorial entrypoint and + an `examples/` script; extract one shared builder both call. +- Every config key documented for users must actually reach a stage; validate + config-to-stage mapping and fail (or warn) on unused keys. +- Don't commit generated analysis/scratch docs into the source tree. +- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `nemo_curator/stages/audio/README.md`. + +### 2.9 Tests and standards (CONTRIBUTING + coding-standards) + +- Source changes need accompanying tests; `tests/stages/audio/` mirrors + `nemo_curator/stages/audio/`. The project enforces 80% coverage under + `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures + under `tests/fixtures/audio/` and sample data under `tests/data/audio/`. +- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header. +- Ruff clean at 119-char lines; type annotations on functions; loguru for logs. +- Every commit is DCO signed-off. +- **References:** `CONTRIBUTING.md`; rules `coding-standards.mdc`, `modality-structure.mdc`; `tests/stages/audio/`. + +### 2.10 PR size and reviewability + +- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by + stage / module / tutorial / benchmark. If a single PR must remain, organize it + as if sliced: no duplicated builder logic, clear module ownership, a reviewable + commit sequence. + +### Smaller recurring nits + +- Missing copyright headers on new files. +- Top-level imports that should be lazy/guarded (very common on audio model modules). +- Long inline prompt/config strings that should be module constants. +- Dataclass fields validated in `__post_init__` that could just be typed. +- Dead conditionals / redundant truthiness checks. +- Pytests added for tutorials that don't warrant them. +- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features - document them as debug throttles. +- `trust_remote_code=True` hardcoded on a model load - expose as a parameter. + +--- + +## 3. Severity mapping + +| Severity | Meaning | +|----------|---------| +| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install | +| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large | +| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config | +| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs | + +--- + +## 4. Pre-review corpus: learn from post-#1608 audio PRs + +PR [#1608](https://github.com/NVIDIA-NeMo/Curator/pull/1608) (`AudioBatch -> +AudioTask` redesign) reset the audio stage contracts. Reviewer feedback on audio +PRs **after** #1608 is the best predictor of what to flag next. Before reviewing +a new PR - especially the first time you review audio in this repo - build the +consolidated corpus of that feedback: + +```bash +# 1) discover audio PRs after #1608 (open + closed/merged) and pull their +# reviews + inline comments into .curator-pr-review/audio-corpus/ +.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 + +# 2) render one consolidated, reviewer-comment corpus +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py +``` + +`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are +monotonic in time, so number > 1608 == opened after #1608), keeps the ones whose +changed files touch audio paths, and pulls each one's reviews, inline comments, +and issue comments. `build_corpus.py` writes +`.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md`: one section per +audio PR (number, title, state, author, link) with every reviewer comment +verbatim, anchored to `path:line`, plus a recurring-themes tally. + +Use the corpus to (a) recognize patterns reviewers repeatedly raise (the lenses +in section 2 came from exactly this), and (b) check whether the PR in front of +you repeats a mistake already called out elsewhere. It is read-only context; it +never auto-posts anything. Re-run periodically to keep it current. + +--- + +## 5. GitHub data + scripts reference + +`scripts/ensure_repo.sh [CLONE_DIR]` - reuse an existing Curator checkout or +shallow-clone one; prints `CURATOR_REPO=<path>` on its last line. + +`scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` pulls into the +scratch outdir: + +| File | Endpoint | +|------|----------| +| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` | +| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` | +| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) | +| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` | +| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` | +| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` | +| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) | + +The REST inline-comments endpoint does not expose resolve/outdate state; the +GraphQL thread payload does. The builder joins them by comment `databaseId` +(== REST `id`), falling back to a `(path, body-prefix)` match when an older +thread dump lacks `databaseId`. + +`scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` +renders the working digest + the prior-open-threads context file (you add your +own findings). + +`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]` +and `scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the +post-#1608 corpus (section 4). + +Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete). diff --git a/.cursor/skills/review-curator-audio-pr/recurring-themes.md b/.cursor/skills/review-curator-audio-pr/recurring-themes.md deleted file mode 100644 index 38b5c826e9..0000000000 --- a/.cursor/skills/review-curator-audio-pr/recurring-themes.md +++ /dev/null @@ -1,142 +0,0 @@ -# Review lenses - recurring themes in Curator audio PRs - -Apply each lens to the diff and cite exact `path:line` evidence. These extend -the always-on rules in `.cursor/rules/` and the audio guide at -`nemo_curator/stages/audio/README.md`; when a change violates a rule, link it by -name. - -## Stage contracts and reuse - -- An audio stage subclasses `ProcessingStage[InputTask, OutputTask]` with - explicit `inputs()` / `outputs()`, a meaningful `name`, and appropriate - `resources` / `batch_size` (set as class attributes / dataclass fields, never - as the read-only `_name` / `_resources` / `_batch_size` properties). -- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in - `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it - runs on the driver). -- Reuse existing audio readers, manifest writers, and shared model abstractions - instead of bespoke orchestration. Model a tarred/sharded reader as a - `CompositeStage` = a discovery stage plus a shard-reader fan-out. -- If `inputs()` does not declare a required key (e.g. `waveform_key`, - `sample_rate_key`), runtime validation will not check it, and a direct - `task.data[key]` access then raises an opaque `KeyError` instead of a clear - validation error. Declare required keys. -- Fan-out shard readers need deterministic child IDs and must propagate - `_metadata` / `_stage_perf`. First-stage shard discovery from an empty input - should constrain placement (e.g. one worker per node) so discovery is not - duplicated across every worker. - -## Setup / teardown lifecycle and audio dependencies - -- `setup_on_node()` = once-per-node prep (model download / shared cache). - `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees - GPU memory. Don't conflate them. -- Import heavy/optional audio deps (vLLM, NeMo, model-specific utils, - soundfile/lhotse, fasttext) lazily inside `setup()` or behind a guarded - import, never unconditionally at module top level, when they ship only with an - optional extra. A top-level `import <optional_pkg>` breaks - `import nemo_curator.stages.audio...` on any install without that extra - (including CPU-only / Mac/ARM). -- Declare the full audio runtime dependency set in the appropriate - `pyproject.toml` optional extra (e.g. an audio CUDA extra); add an import - smoke test under that extra. Don't rely on an external Docker image to supply - dependencies a reusable audio stage needs. -- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries - shared across modalities (`transformers`, `huggingface-hub`, `accelerate`); - prefer compatible ranges so other extras don't break. - -## Secret-safe logging - -- Never log a full resolved config; audio tutorials commonly accept an HF token - / cloud credentials. Redact secret-like keys before logging (token, password, - secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset, - and add a test asserting the secret value never appears in logged output. - -## Waveform/tensor memory and manifest serialization - -- Drop waveform arrays / audio tensors from `task.data` once a stage no longer - needs them. A leaked numpy `ndarray` or torch tensor that reaches a JSON - manifest writer raises `TypeError: Object of type ndarray is not JSON - serializable` and can fail the entire shard/batch, or cause OOM at scale. - Strip by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap - `json.dumps` so a stray non-serializable value reports the offending key. -- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for - non-regular members even after `isfile()` - check before `.read()`. Handle - resampling / channel-count / dtype mismatches explicitly. - -## Tarred / sharded audio I/O - -- Keep infrastructure assumptions (specific object-store URLs, auth env vars, - cluster-local paths) out of the generic reader/writer. Make remote access - pluggable: accept local paths, ordinary cloud URIs, or a user-provided - opener/resolver, and isolate any environment-specific support behind a small, - documented, optional component. For piped/streamed downloads use - failure-aware flags + retries so truncated streams fail loudly. -- Manifest/shard writers that loop per task inside `process_batch()` reopen the - output file repeatedly; group by shard and open each output once per batch. - Preserve `.done`/checkpoint markers so runs are resumable. - -## Streaming, throughput, benchmarking - -- Streaming execution with per-stage resource configuration is the default - model for scalable audio pipelines (see `executors.mdc`). -- Emit stage metrics through the supported metrics path (`metrics/`), JSON- - shaped. For multi-node audio runs include worker/node/GPU identity and latency - distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just - totals/averages. -- Standalone audio benchmark scripts belong in the `benchmarking/` flow - (`AUDIO_PROFILING.md`, `ALM_BENCHMARK.md`) with config entries and comparable - parameters - not freestanding. - -## Tutorials and docs - -- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public - APIs only - no private APIs, no machine-local absolute paths, no committed - noisy notebook output, no executor mismatch between a notebook and its CLI. - Prefer `uv` for environment setup if the tutorial standard does. -- Don't duplicate pipeline-construction logic between a tutorial entrypoint and - an `examples/` script; extract one shared builder both call. -- Every config key documented for users must actually reach a stage; validate - config-to-stage mapping and fail (or warn) on unused keys. -- Don't commit generated analysis/scratch docs into the source tree. - -## Tests and standards (CONTRIBUTING + coding-standards) - -- Source changes need accompanying tests; `tests/stages/audio/` mirrors - `nemo_curator/stages/audio/`. The project enforces 80% coverage under - `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures - under `tests/fixtures/audio/` and sample data under `tests/data/audio/`. -- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header. -- Ruff clean at 119-char lines; type annotations on functions; loguru for logs. -- Every commit is DCO signed-off. - -## PR size and reviewability - -- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by - stage / module / tutorial / benchmark (e.g. reader+writer plumbing, then - inference stages, then postprocessing, then tutorial/config). If a single PR - must remain, organize it as if sliced: no duplicated builder logic, clear - module ownership, a reviewable commit sequence. - -## Smaller recurring nits - -- Missing copyright headers on new files. -- Top-level imports that should be lazy/guarded (very common on audio model - modules). -- Long inline prompt/config strings that should be module constants. -- Dataclass fields validated in `__post_init__` that could just be typed. -- Dead conditionals / redundant truthiness checks. -- Pytests added for tutorials that don't warrant them. -- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features - - document them as debug throttles. -- `trust_remote_code=True` hardcoded on a model load - expose as a parameter so - security-conscious users can opt out. - -## Severity mapping - -| Severity | Meaning | -|----------|---------| -| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install | -| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large | -| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config | -| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs | diff --git a/.cursor/skills/review-curator-audio-pr/reference.md b/.cursor/skills/review-curator-audio-pr/reference.md deleted file mode 100644 index 42aac6f78f..0000000000 --- a/.cursor/skills/review-curator-audio-pr/reference.md +++ /dev/null @@ -1,105 +0,0 @@ -# Reference - NeMo Curator audio PR review - -All paths are relative to the repository root. The skill is self-contained: it -relies only on this repository plus the GitHub CLI (`gh`). Scope: the **audio** -modality. - -## Primary audio guide - -- `nemo_curator/stages/audio/README.md` - audio stage contracts, `process` vs - `process_batch`, lazy/optional imports, metrics. Read this first for any - audio review. - -## Canonical contracts (`.cursor/rules/`) - -Always-on rules describing how stages and pipelines must be written. Cite them -when a PR deviates. - -| Rule | Covers | -|------|--------| -| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks | -| `composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) | -| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow | -| `executors.mdc` | Xenna / Ray Data backends and when to use each | -| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation | -| `pipeline-structure.mdc` | Pipeline assembly | -| `modality-structure.mdc` | Where modality code/tests/tutorials live | -| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | - -## Audio stage map (diff ground truth) - -`nemo_curator/stages/audio/` is organized as: - -| Area | Path | -|------|------| -| Base contract | `nemo_curator/stages/base.py` | -| I/O (tarred/sharded readers, manifest writers) | `nemo_curator/stages/audio/io/` | -| Preprocessing | `nemo_curator/stages/audio/preprocessing/` | -| Inference: ASR / VAD / speaker diarization | `nemo_curator/stages/audio/inference/{asr,vad,speaker_diarization}/` | -| Filtering (band, SigMOS, etc.) | `nemo_curator/stages/audio/filtering/` | -| Segmentation (speaker separation) | `nemo_curator/stages/audio/segmentation/` | -| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/` | -| Postprocessing | `nemo_curator/stages/audio/postprocessing/` | -| Metrics / throughput | `nemo_curator/stages/audio/metrics/` | -| Datasets (fleurs, readspeech) | `nemo_curator/stages/audio/datasets/` | -| ALM (audio-language pretrain) | `nemo_curator/stages/audio/alm/` | -| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/` | - -Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`, -`ray_data/`, `ray_actor_pool/`); pipeline assembly: -`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`. - -## Audio tutorials - -`tutorials/audio/` - `alm/`, `audio_pretrain/`, `callhome_diar/`, `fleurs/`, -`readspeech/`, `single_speaker_filter/`, `tagging/`. Tutorials must be runnable, -use public APIs only, and not duplicate pipeline-building logic that belongs in -a shared builder. - -## Audio performance / benchmarking - -- `benchmarking/README.md` - how benchmarks are wired and run. -- `benchmarking/AUDIO_PROFILING.md` - audio benchmark / perf expectations. -- `benchmarking/ALM_BENCHMARK.md` - ALM pipeline benchmark. - -Standalone audio benchmark scripts should be integrated into this flow, not left -freestanding. - -## Audio tests - -`tests/stages/audio/` mirrors `nemo_curator/stages/audio/` (io, inference, -filtering, segmentation, tagging, postprocessing, preprocessing, metrics, -datasets, alm, advanced_pipelines). Audio fixtures live under -`tests/fixtures/audio/` and `tests/data/audio/`. GPU-only tests are marked -`@pytest.mark.gpu`. - -## Contribution rules - -- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test - requirements, coverage threshold, copyright header. - -## GitHub data via `gh` - -`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir: - -| File | Endpoint | -|------|----------| -| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` | -| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` | -| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) | -| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` | -| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` | -| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` | -| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) | - -The REST inline-comments endpoint does not expose resolve/outdate state; the -GraphQL thread payload does. The builder joins them by comment `databaseId` -(== REST `id`), falling back to a `(path, body-prefix)` match when an older -thread dump lacks `databaseId`. - -## Scripts - -- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch. -- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the working digest + the prior-open-threads context file (you add your own findings). - -Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore). diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md index d270082407..9b96ff1a68 100644 --- a/.cursor/skills/review-curator-audio-pr/scripts/README.md +++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md @@ -1,12 +1,22 @@ # Scripts - review-curator-audio-pr -Helpers for a **reviewer** reviewing someone else's audio Curator PR. They -pull the PR and summarise existing review activity as context; you write the -actual review findings. +Helpers for a **reviewer** reviewing someone else's audio Curator PR. They pull +the PR (and prior audio review history) and summarise it as context; you write +the actual review findings. All require the GitHub CLI (`gh`) authenticated +against github.com, and write to a scratch directory (default +`.curator-pr-review/`) that is safe to delete; do not commit its contents. -Both require the GitHub CLI (`gh`) authenticated against github.com. They write -to a scratch directory (default `.curator-pr-review/`) that is safe to delete; -do not commit its contents. +## ensure_repo.sh + +```bash +eval "$(.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh | tail -1)" # CURATOR_REPO=<path> +``` + +Avoids cloning the whole repo unnecessarily: reuses an existing checkout if you +are already inside one or there is one under the current directory tree +(bounded by `MAXDEPTH`, default 4); only then `git clone --depth 1` (no full +history) from <https://github.com/NVIDIA-NeMo/Curator>. Prints +`CURATOR_REPO=<absolute path>` on the last line. ## pr_review_pull.sh @@ -36,5 +46,22 @@ still-open threads other reviewers left, so you don't duplicate them). The thread join uses comment `databaseId` when present, else a (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify -correctly. The scripts fetch any PR; the audio focus is applied in the review -lenses (see ../recurring-themes.md). +correctly. + +## pull_audio_pr_corpus.sh + build_corpus.py (pre-review learning) + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py +``` + +`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are +monotonic in time, so this is "opened after that PR"), keeps the ones touching +audio paths, and pulls each one's reviews + inline + issue comments into +`.curator-pr-review/audio-corpus/`. `build_corpus.py` consolidates them into +`audio_pr_corpus_<date>.md`: one section per audio PR with every reviewer comment +verbatim (anchored to path:line) plus a recurring-themes tally. Read-only +context; it posts nothing. See `../knowledge-sources.md` section 4. + +The single-PR scripts fetch any PR; the audio focus is applied by the review +lenses in `../knowledge-sources.md`. diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py new file mode 100755 index 0000000000..2df94e3e78 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py @@ -0,0 +1,159 @@ +#!/usr/bin/env python3 +"""Consolidate the audio PR corpus pulled by pull_audio_pr_corpus.sh. + +Reads the per-PR JSON in the corpus dir and writes one markdown file with every +reviewer comment (verbatim, anchored to path:line) grouped by PR, plus a +recurring-themes keyword tally. Read-only context for pre-review; posts nothing. + +Usage: build_corpus.py [--outdir DIR] [--today YYYY-MM-DD] +""" +from __future__ import annotations + +import argparse +import datetime as dt +import json +import re +from pathlib import Path + +BOT_LOGINS = {"greptile-apps[bot]", "copy-pr-bot[bot]", "github-actions[bot]"} + +# (label, regex) recurring-theme buckets, mirroring knowledge-sources.md lenses. +THEMES = [ + ("setup/teardown lifecycle", r"setup_on_node|\bsetup\(|teardown|_setup_done"), + ("optional/lazy imports", r"top[- ]level import|lazy import|import .* fails|optional (dep|extra)"), + ("dependency declaration/pins", r"pyproject|optional[- ]?group|==|version pin|requirement"), + ("stage contract inputs/outputs", r"inputs\(\)|outputs\(\)|validate_input|NotImplementedError"), + ("batch_size / process_batch", r"batch_size|process_batch"), + ("memory / serialization", r"ndarray|json\.dumps|serializ|waveform|tensor|OOM|memory"), + ("fsspec / cloud I/O", r"fsspec|url_to_fs|s3|gcs|http"), + ("secrets / logging", r"token|secret|credential|password|redact"), + ("tests / coverage", r"\btest|coverage|pytest|fixture"), + ("copyright / lint", r"copyright|header|ruff|lint"), + ("naming / convention", r"naming|rename|convention|AudioTask|AudioBatch"), + ("trust_remote_code", r"trust_remote_code"), +] + + +def load(p: Path): + return json.loads(p.read_text()) if p.exists() else [] + + +def shorten(s: str, n: int = 1200) -> str: + s = (s or "").strip() + return s if len(s) <= n else s[:n] + " […]" + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--outdir", default=".curator-pr-review/audio-corpus") + ap.add_argument("--today", default=dt.date.today().isoformat()) + args = ap.parse_args() + + outdir = Path(args.outdir) + nums_file = outdir / "_audio_pr_numbers.txt" + if not nums_file.exists(): + raise SystemExit(f"no {nums_file}; run pull_audio_pr_corpus.sh first") + numbers = [int(x) for x in nums_file.read_text().split() if x.strip()] + numbers.sort(reverse=True) + + date_us = args.today.replace("-", "_") + theme_counts = {label: 0 for label, _ in THEMES} + theme_rx = [(label, re.compile(rx, re.I)) for label, rx in THEMES] + + out: list[str] = [] + out.append(f"# Audio PR review corpus (post-#1608) - {args.today}\n") + out.append( + "Consolidated reviewer feedback on audio PRs opened after #1608 " + "(open + closed/merged). Read-only pre-review context: recognise the " + "patterns reviewers repeatedly raise, and check the PR in front of you " + "against them. Bot reviewers are marked `[bot]`.\n" + ) + out.append(f"Audio PRs in corpus: **{len(numbers)}** " + f"({', '.join('#' + str(n) for n in numbers)})\n") + + per_pr_sections: list[str] = [] + total_comments = 0 + for n in numbers: + gh = load(outdir / f"pr{n}_gh.json") + if isinstance(gh, list): + gh = gh[0] if gh else {} + reviews = load(outdir / f"pr{n}_reviews.json") + rcomments = load(outdir / f"pr{n}_review_comments.json") + icomments = load(outdir / f"pr{n}_issue_comments.json") + + author = (gh.get("author") or {}).get("login", "?") + state = gh.get("state", "?") + title = gh.get("title", "") + url = gh.get("url", f"https://github.com/NVIDIA-NeMo/Curator/pull/{n}") + + sec: list[str] = [] + sec.append(f"## PR #{n} - {title}\n") + sec.append(f"- state: **{state}** author: @{author} " + f"created: {gh.get('createdAt','?')} link: {url}\n") + + # Review summaries (non-empty bodies only). + rev_bodies = [r for r in reviews if (r.get("body") or "").strip()] + if rev_bodies: + sec.append("### Review summaries\n") + for r in rev_bodies: + login = (r.get("user") or {}).get("login", "?") + bot = " `[bot]`" if login in BOT_LOGINS else "" + sec.append(f"- **@{login}{bot}** [{r.get('state','')}] " + f"{r.get('submitted_at','')}:\n") + sec.append(f" > {shorten(r.get('body'))}\n") + for label, rx in theme_rx: + if rx.search(r.get("body") or ""): + theme_counts[label] += 1 + + # Inline comments grouped by file. + by_file: dict[str, list] = {} + for c in rcomments: + by_file.setdefault(c.get("path", "?"), []).append(c) + if by_file: + sec.append("### Inline review comments\n") + for path in sorted(by_file): + sec.append(f"#### `{path}`\n") + for c in sorted(by_file[path], + key=lambda x: (x.get("line") or x.get("original_line") or 0)): + login = (c.get("user") or {}).get("login", "?") + bot = " `[bot]`" if login in BOT_LOGINS else "" + line = c.get("line") or c.get("original_line") or "?" + body = c.get("body") or "" + total_comments += 1 + sec.append(f"- **@{login}{bot}** line {line} " + f"([link]({c.get('html_url','')})):\n") + sec.append(f" > {shorten(body)}\n") + for label, rx in theme_rx: + if rx.search(body): + theme_counts[label] += 1 + + # Top-level issue comments (skip pure CI noise). + human_ic = [c for c in icomments + if (c.get("user") or {}).get("login") not in BOT_LOGINS + and (c.get("body") or "").strip()] + if human_ic: + sec.append("### Discussion (top-level)\n") + for c in human_ic: + login = (c.get("user") or {}).get("login", "?") + sec.append(f"- **@{login}** {c.get('created_at','')}: " + f"{shorten(c.get('body'), 600)}\n") + + per_pr_sections.append("\n".join(sec)) + + # Theme tally up top. + out.append("## Recurring themes (comment hits across the corpus)\n") + out.append("| Theme | Comments mentioning it |") + out.append("|-------|------------------------|") + for label, _ in THEMES: + out.append(f"| {label} | {theme_counts[label]} |") + out.append(f"\nTotal inline review comments scanned: **{total_comments}**\n") + out.append("---\n") + out.extend(per_pr_sections) + + outpath = outdir / f"audio_pr_corpus_{date_us}.md" + outpath.write_text("\n".join(out) + "\n", encoding="utf-8") + print(f"wrote {outpath} ({outpath.stat().st_size} bytes; {len(numbers)} PRs)") + + +if __name__ == "__main__": + main() diff --git a/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh new file mode 100755 index 0000000000..4b9c914c39 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# Make a local NVIDIA-NeMo/Curator checkout available WITHOUT cloning the whole +# repo unnecessarily. +# +# Strategy (cheapest first): +# 1. If we are already inside a Curator checkout, use it. +# 2. Else search the current directory tree (bounded depth) for an existing +# Curator checkout and reuse it. +# 3. Else shallow-clone (--depth 1, no full history) into ./Curator. +# +# Prints diagnostics to stderr and, on the LAST stdout line: +# CURATOR_REPO=<absolute path to the checkout> +# +# Usage: ensure_repo.sh [CLONE_DIR] (default CLONE_DIR=Curator) +# MAXDEPTH=<n> bound the downward search (default 4) +# +# Clone source: https://github.com/NVIDIA-NeMo/Curator +set -euo pipefail + +REPO_URL="https://github.com/NVIDIA-NeMo/Curator.git" +CLONE_DIR="${1:-Curator}" +MAXDEPTH="${MAXDEPTH:-4}" +MARKER="nemo_curator/stages/audio/README.md" + +is_curator() { [[ -f "$1/${MARKER}" && -d "$1/.cursor/rules" ]]; } +abspath() { (cd "$1" && pwd); } + +# 1) Already inside a Curator checkout? +if top="$(git rev-parse --show-toplevel 2>/dev/null)" && is_curator "${top}"; then + echo "ensure_repo: already inside Curator checkout (${top}); not cloning." >&2 + echo "CURATOR_REPO=${top}"; exit 0 +fi +if is_curator "."; then + echo "ensure_repo: current directory is a Curator checkout; not cloning." >&2 + echo "CURATOR_REPO=$(abspath .)"; exit 0 +fi + +# 2) Existing checkout somewhere under the current tree? +while IFS= read -r marker; do + cand="${marker%/${MARKER}}" + if is_curator "${cand}"; then + echo "ensure_repo: reusing existing checkout at ${cand}; not cloning." >&2 + echo "CURATOR_REPO=$(abspath "${cand}")"; exit 0 + fi +done < <(find . -maxdepth "${MAXDEPTH}" -path "*/${MARKER}" 2>/dev/null) + +# 3) Reuse a prior clone dir if present, else shallow-clone. +if is_curator "${CLONE_DIR}"; then + echo "ensure_repo: reusing ${CLONE_DIR}; not cloning." >&2 + echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")"; exit 0 +fi + +command -v git >/dev/null || { echo "ensure_repo: git not found" >&2; exit 2; } +echo "ensure_repo: no local checkout found; shallow-cloning ${REPO_URL} -> ${CLONE_DIR}" >&2 +git clone --depth 1 "${REPO_URL}" "${CLONE_DIR}" +is_curator "${CLONE_DIR}" || { echo "ensure_repo: clone did not produce a valid checkout" >&2; exit 1; } +echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")" diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh new file mode 100755 index 0000000000..50ccb82271 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh @@ -0,0 +1,84 @@ +#!/usr/bin/env bash +# Discover audio-modality PRs opened AFTER a baseline PR (#1608 by default), +# open or closed/merged, and pull each one's reviews + comments into a corpus +# directory. Consolidate with build_corpus.py. +# +# PR numbers are monotonic in time, so "number > --since" == "opened after that +# PR". We then keep only PRs whose changed files touch audio paths. +# +# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] +# +# Requires the GitHub CLI (`gh`) authenticated against github.com. +set -euo pipefail + +REPO="NVIDIA-NeMo/Curator" +SINCE=1608 +OUTDIR=".curator-pr-review/audio-corpus" +LIMIT=600 + +while [[ $# -gt 0 ]]; do + case "$1" in + --since) SINCE="$2"; shift 2 ;; + --outdir) OUTDIR="$2"; shift 2 ;; + --repo) REPO="$2"; shift 2 ;; + --limit) LIMIT="$2"; shift 2 ;; + -h|--help) + echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]"; exit 0 ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; } +mkdir -p "${OUTDIR}" + +# Audio-path test (extended regex over a PR's changed-file list). +AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))' + +echo "=== corpus discovery: ${REPO} PRs > #${SINCE} (state=all, limit=${LIMIT}) ===" >&2 +gh pr list --repo "${REPO}" --state all --limit "${LIMIT}" \ + --json number,title,state,author,createdAt,updatedAt,url \ + > "${OUTDIR}/_all_prs.json" + +# Candidate PR numbers (> SINCE), newest first. +mapfile -t CANDIDATES < <(python3 - "${OUTDIR}/_all_prs.json" "${SINCE}" <<'PY' +import json, sys +prs = json.load(open(sys.argv[1])); since = int(sys.argv[2]) +nums = sorted((p["number"] for p in prs if p["number"] > since), reverse=True) +print("\n".join(str(n) for n in nums)) +PY +) +echo "candidates after #${SINCE}: ${#CANDIDATES[@]}" >&2 + +AUDIO_NUMS=() +for n in "${CANDIDATES[@]}"; do + files_json="${OUTDIR}/pr${n}_files.json" + gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \ + > "${files_json}" 2>/dev/null || { echo " pr${n}: files fetch failed, skip" >&2; continue; } + if AUDIO_RE="${AUDIO_RE}" python3 - "${files_json}" <<'PY' +import json, os, re, sys +files = json.load(open(sys.argv[1])) +rx = re.compile(os.environ["AUDIO_RE"]) +sys.exit(0 if any(rx.search(f) for f in files) else 1) +PY + then + AUDIO_NUMS+=("${n}") + echo " pr${n}: AUDIO" >&2 + else + rm -f "${files_json}" + fi +done + +echo "audio PRs after #${SINCE}: ${#AUDIO_NUMS[@]}" >&2 +printf '%s\n' "${AUDIO_NUMS[@]}" > "${OUTDIR}/_audio_pr_numbers.txt" + +for n in "${AUDIO_NUMS[@]}"; do + echo "--- pulling pr${n} reviews/comments ---" >&2 + gh pr view "${n}" --repo "${REPO}" \ + --json number,title,state,author,createdAt,updatedAt,mergedAt,closedAt,url,body \ + > "${OUTDIR}/pr${n}_gh.json" + gh api --paginate "repos/${REPO}/pulls/${n}/reviews" > "${OUTDIR}/pr${n}_reviews.json" + gh api --paginate "repos/${REPO}/pulls/${n}/comments" > "${OUTDIR}/pr${n}_review_comments.json" + gh api --paginate "repos/${REPO}/issues/${n}/comments" > "${OUTDIR}/pr${n}_issue_comments.json" +done + +echo "AUDIO_PR_CORPUS_PULL_DONE outdir=${OUTDIR} audio_prs=${#AUDIO_NUMS[@]}" >&2 From b24b6472f8de332a5e09b70a5b235543cf7cde4d Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Fri, 5 Jun 2026 15:13:55 +0530 Subject: [PATCH 05/10] Link published fern audio docs and exact dev-guide slide in knowledge sources Audit gap: the knowledge guide cited READMEs, .cursor/rules, and the slides deck but not the published audio documentation set under fern/versions/v26.04/pages/curate-audio. Add a fern doc map (concept -> doc -> code), wire fern refs into the memory/throughput/tutorial lenses, include the readspeech notebook, and pin the slides URL to the exact slide anchor. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../knowledge-sources.md | 36 ++++++++++++++++--- 1 file changed, 32 insertions(+), 4 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md index cb76080537..93690f8587 100644 --- a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md +++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md @@ -22,7 +22,7 @@ the GitHub CLI (`gh`). Scope: the **audio** modality. | Source | What it gives you | |--------|-------------------| | `nemo_curator/stages/audio/README.md` | The audio stage **developer guide** (1000+ lines): CPU vs GPU stages, `process` / `process_batch`, `batch_size`, call chains, per-stage memory characteristics, end-to-end FLEURS & ALM traces, new-stage checklist. Read first. | -| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). | +| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit?slide=id.g3be823d0e3d_0_0#slide=id.g3be823d0e3d_0_0> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). | | `nemo_curator/tasks/audio_task.py` | `AudioTask(Task[dict])` + `_AttrDict` task model (one manifest entry per task; dict keys exposed as attributes for `validate_input`). | | `nemo_curator/stages/base.py` | `ProcessingStage` contract: `inputs()`/`outputs()`, `process`/`process_batch`, `setup_on_node`/`setup`/`teardown`, `validate_input`. | | `CONTRIBUTING.md` | DCO sign-off, PR process, required tests, coverage threshold, copyright header. | @@ -44,6 +44,34 @@ the GitHub CLI (`gh`). Scope: the **audio** modality. | `.cursor/rules/modality-structure.mdc` | Where modality code/tests/tutorials live | | `.cursor/rules/coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | +### Published audio documentation (fern docs site) + +The user-facing audio docs live under `fern/versions/<ver>/pages/` (latest +versioned snapshot: **v26.04**; `fern/versions/main/` holds in-progress docs). +These are the canonical "what the docs promise" reference - cite them when a PR's +behavior, config, or API drifts from documented behavior. Paths below are +relative to `fern/versions/v26.04/pages/`. + +| Audio concept / stage | Fern doc(s) | Code it documents | +|------------------------|-------------|-------------------| +| Overview / get started | `get-started/audio.mdx`, `curate-audio/index.mdx` | - | +| Concepts (task model, pipelines, metrics) | `about/concepts/audio/{index,audio-task,curation-pipeline,asr-pipeline,alm-pipeline,manifests-ingest,quality-metrics,text-integration}.mdx` | `nemo_curator/tasks/audio_task.py` | +| Load data / manifests | `curate-audio/load-data/{index,fleurs-dataset,custom-manifests,local-files}.mdx` | `nemo_curator/stages/audio/datasets/`, `.../io/` | +| ASR inference | `curate-audio/process-data/asr-inference/{index,nemo-models}.mdx` | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` | +| Audio analysis (duration, format) | `curate-audio/process-data/audio-analysis/{index,duration-calculation,format-validation}.mdx` | `nemo_curator/stages/audio/common.py` | +| Quality filtering (band, VAD, SigMOS, UTMOS, speaker separation, preprocessing, data-filter) | `curate-audio/process-data/quality-filtering/{index,band-filter,vad,sigmos,utmos,speaker-separation,preprocessing,audio-data-filter-stage}.mdx` | `.../filtering/`, `.../segmentation/`, `.../preprocessing/`, `.../advanced_pipelines/` | +| Quality assessment (WER, duration) | `curate-audio/process-data/quality-assessment/{index,wer-filtering,duration-filtering}.mdx` | `nemo_curator/stages/audio/metrics/wer.py`, `.../common.py` | +| ALM (data builder, overlap filtering) | `curate-audio/process-data/alm/{index,data-builder,overlap-filtering}.mdx` | `nemo_curator/stages/audio/alm/` | +| Text integration / tagging | `curate-audio/process-data/text-integration/index.mdx` | `nemo_curator/stages/audio/tagging/` | +| Save / export | `curate-audio/save-export.mdx` | `nemo_curator/stages/audio/io/convert.py` | +| Tutorials | `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx` | `tutorials/audio/` | +| API reference | `api-reference/tasks/audio-task.mdx`, `api-reference/processing-stage.mdx` | `nemo_curator/tasks/audio_task.py`, `nemo_curator/stages/base.py` | +| Infrastructure (memory, per-stage runtime, backends, GPU, resumable, monitoring) | `reference/infrastructure/{memory-management,per-stage-runtime,execution-backends,gpu-processing,resumable-processing,monitoring}.mdx` | `nemo_curator/backends/` | +| Release notes / migration (`AudioBatch -> AudioTask`, #1608) | `about/release-notes/{index,migration-guide,migration-faq}.mdx` | see PR #1608 | + +When a PR changes audio behavior, check whether the matching fern page above +needs updating in the same PR - docs drift is a common review finding. + --- ## 1. Audio code map (diff ground truth) @@ -152,7 +180,7 @@ canonical sources to cite. - Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for non-regular members even after `isfile()` - check before `.read()`. Handle resampling / channel-count / dtype mismatches explicitly. -- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics". +- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics"; fern `reference/infrastructure/{memory-management,per-stage-runtime}.mdx`. ### 2.5 Tarred / sharded audio I/O @@ -187,7 +215,7 @@ canonical sources to cite. totals/averages. - Standalone audio benchmark scripts belong in the `benchmarking/` flow with config entries and comparable parameters - not freestanding. -- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`. +- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`; fern `reference/infrastructure/{execution-backends,per-stage-runtime,monitoring,gpu-processing}.mdx`. ### 2.8 Tutorials and docs @@ -199,7 +227,7 @@ canonical sources to cite. - Every config key documented for users must actually reach a stage; validate config-to-stage mapping and fail (or warn) on unused keys. - Don't commit generated analysis/scratch docs into the source tree. -- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `nemo_curator/stages/audio/README.md`. +- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `tutorials/audio/readspeech/readspeech_tutorial.ipynb`; `nemo_curator/stages/audio/README.md`; fern `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx`. ### 2.9 Tests and standards (CONTRIBUTING + coding-standards) From 3858d0eaa9e19768cb734371227c3b69a6463dc8 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Fri, 5 Jun 2026 15:21:44 +0530 Subject: [PATCH 06/10] Lead the review with a detailed PR overview before any findings build_digest.py now emits a What this PR does (overview) section first: author description, an areas-touched table, and a plain-language summary placeholder, ahead of PR state, existing reviews/comments, and findings. SKILL.md and templates.md require presenting this overview before any review comments (new Step 6). Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 37 +++++++++----- .../scripts/build_digest.py | 49 +++++++++++++++++++ .../review-curator-audio-pr/templates.md | 29 +++++++++-- 3 files changed, 99 insertions(+), 16 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index 30078000b0..f308b2ef30 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -47,16 +47,21 @@ with *"build the post-#1608 audio review corpus first, then review PR <N>"* ## What you produce -Your review = a set of **findings** (P0-P3), each tied to a `path:line` on the -PR's current head, that you post as PR review comments. The skill writes two -helper files into a scratch directory (default `.curator-pr-review/`, override -with `--outdir`) to support that: - -1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: PR +You produce two things, **in this order**: (1) a **detailed overview of what the +PR does and why** - so the reader understands the change before any critique - +then (2) your **findings** (P0-P3), each tied to a `path:line` on the PR's +current head, that you post as PR review comments. Never lead with review +comments; always explain the PR first. + +The skill writes two helper files into a scratch directory (default +`.curator-pr-review/`, override with `--outdir`) to support that: + +1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: leads + with a **"What this PR does (overview)"** section (the author's description, + an areas-touched table, and a plain-language summary you write), then PR state, commits, changed-file table, and every existing review/comment grouped - by file with OPEN / OUTDATED / RESOLVED status, plus a placeholder for your - findings. Read it to understand the PR and to see what other reviewers already - raised so you don't repeat them; record your findings at the bottom. + by file with OPEN / OUTDATED / RESOLVED status, then a placeholder for your + findings. Fill in the overview first, then record your findings at the bottom. 2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**: a condensed list of the threads other reviewers left that are still unresolved on the current head. Scan it before writing your own comments. @@ -74,7 +79,8 @@ review comments. The review lenses and every doc/code reference live in - [ ] 3. Read the diff - [ ] 4. Review the changed code through the audio review lenses - [ ] 5. Generate the digest + prior-threads file for context -- [ ] 6. Write up your findings (P0-P3) and post them as review comments +- [ ] 6. Write the detailed PR overview (present this first) +- [ ] 7. Write up your findings (P0-P3) and post them after the overview ``` ### Step 0 - Locate or shallow-clone the repo @@ -140,7 +146,16 @@ status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head), **RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer already made. -### Step 6 - Write up and post your findings +### Step 6 - Summarize the PR in detail (present this first) + +Before any critique, write the **"What this PR does (overview)"** section of the +digest so a reader understands the change end to end: the problem it solves; the +main audio stages/files it adds or modifies (use the "Areas touched" table the +builder generates); key design decisions; new dependencies, config, or APIs; and +the blast radius (what could regress). Lead your review with this overview - the +author and other reviewers read it before any findings. + +### Step 7 - Write up and post your findings Record each issue as a finding, classified by severity, and post them as a PR review (inline comments for specific lines, a top-level summary for the overall diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py index 8c129103e1..923323d102 100755 --- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py @@ -91,6 +91,24 @@ def p(parts, *lines): parts.extend(lines) +def area_of(path): + """Group a changed file into a coarse audio area for the overview table.""" + parts = path.split("/") + if path.startswith("nemo_curator/stages/audio/") and len(parts) >= 4: + return f"stages/audio/{parts[3]}" + if path.startswith("nemo_curator/tasks/"): + return "tasks" + if path.startswith("tutorials/audio/") and len(parts) >= 3: + return f"tutorials/audio/{parts[2]}" + if path.startswith("tests/"): + return "tests" + if path.startswith("benchmarking/"): + return "benchmarking" + if path.startswith("fern/"): + return "docs (fern)" + return parts[0] if parts else path + + def build(pr, outdir, today, baseline_ts, prev_head): def latest(kind): return outdir / f"pr{pr}_{kind}_latest.json" @@ -130,6 +148,37 @@ def base(kind): p(d, f"Previous reviewed head: `{prev_head}`", "") p(d, "") + # ----- PR OVERVIEW (what the PR is about; present BEFORE any review) ----- + author = (gh.get("author") or {}).get("login", "?") + p(d, "## What this PR does (overview)", "") + p(d, f"**{gh.get('title','')}** - by @{author} " + f"({gh.get('state')}, +{gh.get('additions')}/-{gh.get('deletions')} " + f"across {gh.get('changedFiles')} files, {len(commits)} commits)", "") + + p(d, "### Author's description", "") + pr_body = (gh.get("body") or "").strip() + p(d, pr_body if pr_body else "_(no PR description provided)_", "") + + p(d, "### Areas touched", "") + areas = {} + for f in files: + e = areas.setdefault(area_of(f["filename"]), {"n": 0, "add": 0, "dele": 0}) + e["n"] += 1 + e["add"] += f.get("additions", 0) + e["dele"] += f.get("deletions", 0) + p(d, "| Area | files | +/- |", "|---|---|---|") + for a in sorted(areas): + e = areas[a] + p(d, f"| {a} | {e['n']} | +{e['add']} -{e['dele']} |") + p(d, "") + + p(d, "### Plain-language summary (reviewer writes this BEFORE any findings)", "") + p(d, "_In your own words, explain in detail what this PR changes and why: the " + "problem it solves, the main audio stages/files it adds or modifies, key " + "design decisions, new dependencies/config/APIs, and the blast radius " + "(what could regress). Present this overview first; only then move to the " + "findings below._", "") + p(d, "## PR state at review time", "") p(d, "| Field | Value |", "|---|---|") p(d, f"| state | {gh.get('state')} |") diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md index f789a09aad..3e0d1b6669 100644 --- a/.cursor/skills/review-curator-audio-pr/templates.md +++ b/.cursor/skills/review-curator-audio-pr/templates.md @@ -2,14 +2,17 @@ You are the reviewer. `scripts/build_digest.py` generates files A and B as context; you author the findings (section C) and post them as your review. -Replace `<...>` placeholders. Keep the structure stable across review passes so -a re-review diffs cleanly and only surfaces what changed. +The digest **leads with a PR overview** that you must complete and present +before any review comments. Replace `<...>` placeholders. Keep the structure +stable across review passes so a re-review diffs cleanly and only surfaces what +changed. ## A. Working digest (generated, then you append findings) -File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. The generated part gives you -PR state, the diff, and all existing review activity. Add your own **Findings** -section at the end - that is your review. +File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. It **leads with a PR +overview** (what the change does and why), then PR state, the diff, and all +existing review activity. Complete the overview's plain-language summary first +and present it before any findings; add your **Findings** section at the end. ```markdown # Curator PR <N> Fresh Review - <YYYY-MM-DD> @@ -20,6 +23,22 @@ Current PR head reviewed: `<headRefOid>` Base recorded by GitHub metadata: `<baseRefOid>` +## What this PR does (overview) + +**<title>** - by @<author> (<state>, +<a>/-<d> across <n> files, <k> commits) + +### Author's description +<verbatim PR body> + +### Areas touched +| Area | files | +/- | +|---|---|---| +| stages/audio/<area> | <n> | +<a> -<d> | + +### Plain-language summary (write this BEFORE any findings) +<In your own words: what the PR changes and why, the main stages/files, key +design decisions, new deps/config/APIs, and blast radius.> + ## PR state at review time | Field | Value | From 209332fe93bff9bc021c1d2bee4f86a7a61546c6 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Fri, 5 Jun 2026 15:24:16 +0530 Subject: [PATCH 07/10] Order workflow so the PR overview is written after reading knowledge sources Make reading knowledge-sources.md an explicit Step 4 (canonical docs + code map + lenses) that precedes the detailed PR overview (Step 6); lens application moves into the findings step (Step 7). Overview placeholders in build_digest.py and templates.md note the summary is written after grounding in the knowledge sources. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 74 ++++++++++++------- .../scripts/build_digest.py | 3 +- .../review-curator-audio-pr/templates.md | 5 +- 3 files changed, 52 insertions(+), 30 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index f308b2ef30..46430b5fd2 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -34,10 +34,11 @@ you already have and only shallow-clones A human reviewer invokes this skill with a prompt like: > **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or -> shallow-clone the repo, pull the PR and prior review activity, apply the audio -> review lenses, and give me a review digest plus my own findings (P0-P3) with -> exact `path:line` evidence and concrete fixes, ready to post as PR comments. -> Skip anything other reviewers already raised."** +> shallow-clone the repo, pull the PR and prior review activity, read the +> knowledge sources, then give me a detailed overview of what the PR does +> followed by my own findings (P0-P3) with exact `path:line` evidence and +> concrete fixes, ready to post as PR comments. Skip anything other reviewers +> already raised."** Variants: *"re-review PR <N> and show only what changed since the last review"*, *"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and @@ -48,10 +49,12 @@ with *"build the post-#1608 audio review corpus first, then review PR <N>"* ## What you produce You produce two things, **in this order**: (1) a **detailed overview of what the -PR does and why** - so the reader understands the change before any critique - -then (2) your **findings** (P0-P3), each tied to a `path:line` on the PR's -current head, that you post as PR review comments. Never lead with review -comments; always explain the PR first. +PR does and why** - written only after you have read the knowledge sources (the +canonical docs and review lenses), so the reader understands the change before +any critique - then (2) your **findings** (P0-P3), each tied to a `path:line` on +the PR's current head, that you post as PR review comments. Never lead with +review comments; ground yourself in the knowledge sources, then explain the PR, +then give findings. The skill writes two helper files into a scratch directory (default `.curator-pr-review/`, override with `--outdir`) to support that: @@ -77,10 +80,10 @@ review comments. The review lenses and every doc/code reference live in - [ ] 1. Identify the PR and confirm it touches audio paths - [ ] 2. Pull fresh GitHub data (gh) - [ ] 3. Read the diff -- [ ] 4. Review the changed code through the audio review lenses +- [ ] 4. Read the knowledge sources (canonical docs + review lenses) to ground yourself - [ ] 5. Generate the digest + prior-threads file for context -- [ ] 6. Write the detailed PR overview (present this first) -- [ ] 7. Write up your findings (P0-P3) and post them after the overview +- [ ] 6. Write the detailed PR overview - only after steps 3-4 - present this first +- [ ] 7. Review through the lenses, write up findings (P0-P3), and post them after the overview ``` ### Step 0 - Locate or shallow-clone the repo @@ -125,15 +128,22 @@ gh pr diff <N> --repo NVIDIA-NeMo/Curator # works on a shallow clone shallow clone from step 0. Always review the actual diff, never the comment text alone. Use `main` as the baseline for "how the audio stages already do this". -### Step 4 - Review through the audio lenses +### Step 4 - Read the knowledge sources (ground yourself first) -Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 - -each lens links the audio code, README section, and `.cursor/rules` contract it -governs: stage contracts, setup/teardown lifecycle, audio optional-dependency -hygiene (vLLM, NeMo, model utils), secret-safe logging, waveform/tensor memory -and manifest serialization, tarred/sharded audio I/O, sample-rate and metadata -propagation, streaming/throughput, tutorials/docs, tests/coverage, and PR -reviewability. Cite the linked source by name whenever the PR deviates. +Before you explain or judge anything, read +[knowledge-sources.md](knowledge-sources.md) so your overview and findings are +grounded in the canonical material, not guesswork: + +- **section 0** - canonical docs: the audio stage developer guide + (`nemo_curator/stages/audio/README.md`), the developer-guide slides, the + published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`. +- **section 1** - the audio code map, so you know where each changed file sits. +- **section 2** - the review lenses (you apply these in step 7). + +Open the specific sources the diff touches (e.g. the fern page and the code for +the stage being changed). First time reviewing audio in this repo, optionally +build the post-#1608 reviewer-comment corpus (section 4) to see what reviewers +repeatedly raise. Only after this grounding do you write the overview (step 6). ### Step 5 - Generate the context files @@ -148,19 +158,29 @@ already made. ### Step 6 - Summarize the PR in detail (present this first) -Before any critique, write the **"What this PR does (overview)"** section of the -digest so a reader understands the change end to end: the problem it solves; the +Now that you have read the diff (step 3) and the knowledge sources (step 4), +write the **"What this PR does (overview)"** section of the digest so a reader +understands the change end to end: the problem it solves; the main audio stages/files it adds or modifies (use the "Areas touched" table the builder generates); key design decisions; new dependencies, config, or APIs; and the blast radius (what could regress). Lead your review with this overview - the author and other reviewers read it before any findings. -### Step 7 - Write up and post your findings - -Record each issue as a finding, classified by severity, and post them as a PR -review (inline comments for specific lines, a top-level summary for the overall -verdict). Each finding cites the exact `path:line-range` on the current head and -proposes a concrete fix. +### Step 7 - Review through the lenses, then write up and post your findings + +Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the +diff - each lens links the audio code, README section, and `.cursor/rules` +contract it governs: stage contracts, setup/teardown lifecycle, audio +optional-dependency hygiene (vLLM, NeMo, model utils), secret-safe logging, +waveform/tensor memory and manifest serialization, tarred/sharded audio I/O, +sample-rate and metadata propagation, streaming/throughput, tutorials/docs, +tests/coverage, and PR reviewability. Cite the linked source by name whenever +the PR deviates. + +Record each issue as a finding, classified by severity, and post them **after +the overview** as a PR review (inline comments for specific lines, a top-level +summary for the overall verdict). Each finding cites the exact `path:line-range` +on the current head and proposes a concrete fix. - **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a manifest writer that crashes on a kept waveform), a secret (HF token) leaked diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py index 923323d102..64ccba1fe0 100755 --- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py @@ -173,7 +173,8 @@ def base(kind): p(d, "") p(d, "### Plain-language summary (reviewer writes this BEFORE any findings)", "") - p(d, "_In your own words, explain in detail what this PR changes and why: the " + p(d, "_After reading the knowledge sources (knowledge-sources.md), explain in " + "detail, in your own words, what this PR changes and why: the " "problem it solves, the main audio stages/files it adds or modifies, key " "design decisions, new dependencies/config/APIs, and the blast radius " "(what could regress). Present this overview first; only then move to the " diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md index 3e0d1b6669..a9cc9548f9 100644 --- a/.cursor/skills/review-curator-audio-pr/templates.md +++ b/.cursor/skills/review-curator-audio-pr/templates.md @@ -36,8 +36,9 @@ Base recorded by GitHub metadata: `<baseRefOid>` | stages/audio/<area> | <n> | +<a> -<d> | ### Plain-language summary (write this BEFORE any findings) -<In your own words: what the PR changes and why, the main stages/files, key -design decisions, new deps/config/APIs, and blast radius.> +<In your own words, after reading the knowledge sources: what the PR changes and +why, the main stages/files, key design decisions, new deps/config/APIs, and +blast radius.> ## PR state at review time From f8ae3bd3b43c03e390ececdb7dfccd71d1794a80 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Tue, 9 Jun 2026 17:57:28 +0530 Subject: [PATCH 08/10] review-curator-audio-pr: audio-only enforcement, incremental corpus pulls, doc cosmetics Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 18 ++++--- .../knowledge-sources.md | 9 ++-- .../review-curator-audio-pr/scripts/README.md | 15 ++++-- .../scripts/build_digest.py | 16 ++++++ .../scripts/pr_review_pull.sh | 21 +++++++- .../scripts/pull_audio_pr_corpus.sh | 50 ++++++++++++++++--- .../review-curator-audio-pr/templates.md | 16 ++++-- 7 files changed, 117 insertions(+), 28 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index 46430b5fd2..241a61a29e 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -19,10 +19,11 @@ Curator PR.** Given a PR number, it helps you understand the change, see what other reviewers already said, and write your own review. It does not help a PR author reply to reviewers - it helps you *produce* the review. -Scope: the **audio** modality - code under `nemo_curator/stages/audio/`, -`tutorials/audio/`, audio tests under `tests/stages/audio/`, and audio -benchmarks. For non-audio PRs the stage-contract lenses still apply, but the -audio-specific guidance below will not. +Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`, +`nemo_curator/tasks/audio_task.py`, `tutorials/audio/`, audio tests under +`tests/stages/audio/`, and audio benchmarks. This skill is **audio-only**: if a +PR touches no audio path, the pull step refuses to run (step 2) - review it +with a modality-appropriate tool instead. Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and `git`. You do **not** need to pre-clone the repo - step 0 reuses any checkout @@ -101,10 +102,11 @@ If you are already in the checkout that contains this skill, you can skip this. ### Step 1 - Identify the PR -`gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio -paths (`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`, -`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks); if it does not, -this audio skill is the wrong lens. +`gh pr view <N> --repo NVIDIA-NeMo/Curator`. This skill is **audio-only**: the +pull step (step 2) aborts if the PR touches no audio path +(`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`, +`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks). If it aborts, the +PR is out of scope for this skill, so stop and do not review it here. ### Step 2 - Pull fresh GitHub data diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md index 93690f8587..e2ef795727 100644 --- a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md +++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md @@ -291,7 +291,9 @@ consolidated corpus of that feedback: `pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are monotonic in time, so number > 1608 == opened after #1608), keeps the ones whose changed files touch audio paths, and pulls each one's reviews, inline comments, -and issue comments. `build_corpus.py` writes +and issue comments. It is incremental - reruns skip PRs already on disk and only +fetch new ones (use `--refresh` to re-pull, e.g. to refresh open PRs). +`build_corpus.py` writes `.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md`: one section per audio PR (number, title, state, author, link) with every reviewer comment verbatim, anchored to `path:line`, plus a recurring-themes tally. @@ -330,8 +332,9 @@ thread dump lacks `databaseId`. renders the working digest + the prior-open-threads context file (you add your own findings). -`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]` -and `scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the +`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]` +(incremental - skips PRs already on disk; `--refresh` re-pulls) and +`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the post-#1608 corpus (section 4). Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete). diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md index 9b96ff1a68..e210b0c062 100644 --- a/.cursor/skills/review-curator-audio-pr/scripts/README.md +++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md @@ -27,6 +27,8 @@ history) from <https://github.com/NVIDIA-NeMo/Curator>. Prints Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue `comments`, `files`, `commits`) plus the GraphQL review threads into `pr<N>_*_latest.json` (and timestamped snapshots for delta analysis). +**Audio-only:** it pulls metadata + files first and aborts (exit 3) if the PR +touches no audio path, so the skill never reviews a non-audio PR. ## build_digest.py @@ -46,7 +48,7 @@ still-open threads other reviewers left, so you don't duplicate them). The thread join uses comment `databaseId` when present, else a (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify -correctly. +correctly. It also re-checks the audio-only guard and aborts on a non-audio PR. ## pull_audio_pr_corpus.sh + build_corpus.py (pre-review learning) @@ -58,10 +60,15 @@ correctly. `pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are monotonic in time, so this is "opened after that PR"), keeps the ones touching audio paths, and pulls each one's reviews + inline + issue comments into -`.curator-pr-review/audio-corpus/`. `build_corpus.py` consolidates them into +`.curator-pr-review/audio-corpus/`. It is **incremental**: PRs already pulled are +skipped (and non-audio candidates are remembered in `_non_audio_prs.txt`), so a +rerun only fetches PRs new since last time. Pass `--refresh` to re-pull +everything (e.g. to pick up new comments on still-open PRs). +`build_corpus.py` consolidates them into `audio_pr_corpus_<date>.md`: one section per audio PR with every reviewer comment verbatim (anchored to path:line) plus a recurring-themes tally. Read-only context; it posts nothing. See `../knowledge-sources.md` section 4. -The single-PR scripts fetch any PR; the audio focus is applied by the review -lenses in `../knowledge-sources.md`. +The single-PR scripts are **audio-only**: both `pr_review_pull.sh` and +`build_digest.py` abort on a PR that touches no audio path. The audio review +lenses live in `../knowledge-sources.md`. diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py index 64ccba1fe0..7b920acd18 100755 --- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py @@ -19,10 +19,18 @@ import argparse import datetime as dt import json +import re from pathlib import Path BODY_KEY_LEN = 120 +# This skill is audio-only; a PR must touch at least one of these paths. +AUDIO_RE = re.compile( + r"^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|" + r"tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|" + r"benchmarking/.*([Aa]udio|ALM|alm))" +) + def load(path: Path) -> object: return json.loads(path.read_text()) @@ -119,6 +127,14 @@ def latest(kind): issue_comments = load(latest("issue_comments")) files = load(latest("files")) commits = load(latest("commits")) + + if files and not any(AUDIO_RE.search(f.get("filename", "")) for f in files): + raise SystemExit( + f"PR {pr} touches no audio path; review-curator-audio-pr is audio-only. " + "Aborting (audio paths: nemo_curator/stages/audio/, " + "nemo_curator/tasks/audio_task.py, tutorials/audio/, " + "tests/stages/audio/, audio benchmarks)." + ) threads_path = latest("review_threads") idx = build_thread_index(load(threads_path)) if threads_path.exists() else build_thread_index({}) diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh index f14f649ed3..9092688f89 100755 --- a/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh +++ b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh @@ -48,14 +48,31 @@ GH_FIELDS="number,title,state,isDraft,mergeable,mergeStateStatus,headRefName,hea pull_endpoint "pr ${PR} metadata" "${OUTDIR}/pr${PR}_gh_${TS}.json" \ gh pr view "${PR}" --repo "${REPO}" --json "${GH_FIELDS}" +pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/files" + +# This skill is audio-only: abort before pulling anything else if the PR's +# changed files touch no audio path. +AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))' +if ! AUDIO_RE="${AUDIO_RE}" python3 - "${OUTDIR}/pr${PR}_files_${TS}.json" <<'PY' +import json, os, re, sys +files = [f.get("filename", "") for f in json.load(open(sys.argv[1]))] +rx = re.compile(os.environ["AUDIO_RE"]) +sys.exit(0 if any(rx.search(f) for f in files) else 1) +PY +then + echo "error: PR ${PR} touches no audio path; review-curator-audio-pr is audio-only." >&2 + echo " audio paths: nemo_curator/stages/audio/, nemo_curator/tasks/audio_task.py," >&2 + echo " tutorials/audio/, tests/stages/audio/, audio benchmarks. Aborting." >&2 + exit 3 +fi + pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \ gh api --paginate "repos/${REPO}/pulls/${PR}/reviews" pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \ gh api --paginate "repos/${REPO}/pulls/${PR}/comments" pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \ gh api --paginate "repos/${REPO}/issues/${PR}/comments" -pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ - gh api --paginate "repos/${REPO}/pulls/${PR}/files" pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \ gh api --paginate "repos/${REPO}/pulls/${PR}/commits" diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh index 50ccb82271..a88cdb8659 100755 --- a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh +++ b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh @@ -6,7 +6,11 @@ # PR numbers are monotonic in time, so "number > --since" == "opened after that # PR". We then keep only PRs whose changed files touch audio paths. # -# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] +# Incremental by default: PRs already pulled into OUTDIR are skipped, so reruns +# only fetch PRs that are new since last time. Pass --refresh to re-pull +# everything (e.g. to pick up new comments on open PRs). +# +# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh] # # Requires the GitHub CLI (`gh`) authenticated against github.com. set -euo pipefail @@ -15,15 +19,17 @@ REPO="NVIDIA-NeMo/Curator" SINCE=1608 OUTDIR=".curator-pr-review/audio-corpus" LIMIT=600 +REFRESH=0 while [[ $# -gt 0 ]]; do case "$1" in - --since) SINCE="$2"; shift 2 ;; - --outdir) OUTDIR="$2"; shift 2 ;; - --repo) REPO="$2"; shift 2 ;; - --limit) LIMIT="$2"; shift 2 ;; + --since) SINCE="$2"; shift 2 ;; + --outdir) OUTDIR="$2"; shift 2 ;; + --repo) REPO="$2"; shift 2 ;; + --limit) LIMIT="$2"; shift 2 ;; + --refresh) REFRESH=1; shift ;; -h|--help) - echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]"; exit 0 ;; + echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]"; exit 0 ;; *) echo "unknown arg: $1" >&2; exit 2 ;; esac done @@ -31,6 +37,13 @@ done command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; } mkdir -p "${OUTDIR}" +# Cache of candidate PRs already classified as non-audio, so reruns don't +# re-fetch their file lists. Audio PRs are remembered by their pr<N>_gh.json. +NONAUDIO_CACHE="${OUTDIR}/_non_audio_prs.txt" +touch "${NONAUDIO_CACHE}" +declare -A IS_NONAUDIO=() +while IFS= read -r x; do [[ -n "${x}" ]] && IS_NONAUDIO["${x}"]=1; done < "${NONAUDIO_CACHE}" + # Audio-path test (extended regex over a PR's changed-file list). AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))' @@ -51,6 +64,15 @@ echo "candidates after #${SINCE}: ${#CANDIDATES[@]}" >&2 AUDIO_NUMS=() for n in "${CANDIDATES[@]}"; do + # Incremental: reuse prior classification unless --refresh. + if [[ ${REFRESH} -eq 0 ]]; then + if [[ -f "${OUTDIR}/pr${n}_gh.json" ]]; then + AUDIO_NUMS+=("${n}"); echo " pr${n}: AUDIO (on disk, skip)" >&2; continue + fi + if [[ -n "${IS_NONAUDIO[$n]:-}" ]]; then + echo " pr${n}: non-audio (cached, skip)" >&2; continue + fi + fi files_json="${OUTDIR}/pr${n}_files.json" gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \ > "${files_json}" 2>/dev/null || { echo " pr${n}: files fetch failed, skip" >&2; continue; } @@ -65,13 +87,26 @@ PY echo " pr${n}: AUDIO" >&2 else rm -f "${files_json}" + if [[ -z "${IS_NONAUDIO[$n]:-}" ]]; then + echo "${n}" >> "${NONAUDIO_CACHE}"; IS_NONAUDIO["${n}"]=1 + fi fi done echo "audio PRs after #${SINCE}: ${#AUDIO_NUMS[@]}" >&2 printf '%s\n' "${AUDIO_NUMS[@]}" > "${OUTDIR}/_audio_pr_numbers.txt" +pulled=0 +skipped=0 for n in "${AUDIO_NUMS[@]}"; do + # Incremental: skip PRs whose data is already on disk unless --refresh. + if [[ ${REFRESH} -eq 0 && -f "${OUTDIR}/pr${n}_gh.json" \ + && -f "${OUTDIR}/pr${n}_reviews.json" \ + && -f "${OUTDIR}/pr${n}_review_comments.json" \ + && -f "${OUTDIR}/pr${n}_issue_comments.json" ]]; then + echo "--- pr${n}: already on disk, skip (use --refresh to re-pull) ---" >&2 + skipped=$((skipped + 1)); continue + fi echo "--- pulling pr${n} reviews/comments ---" >&2 gh pr view "${n}" --repo "${REPO}" \ --json number,title,state,author,createdAt,updatedAt,mergedAt,closedAt,url,body \ @@ -79,6 +114,7 @@ for n in "${AUDIO_NUMS[@]}"; do gh api --paginate "repos/${REPO}/pulls/${n}/reviews" > "${OUTDIR}/pr${n}_reviews.json" gh api --paginate "repos/${REPO}/pulls/${n}/comments" > "${OUTDIR}/pr${n}_review_comments.json" gh api --paginate "repos/${REPO}/issues/${n}/comments" > "${OUTDIR}/pr${n}_issue_comments.json" + pulled=$((pulled + 1)) done -echo "AUDIO_PR_CORPUS_PULL_DONE outdir=${OUTDIR} audio_prs=${#AUDIO_NUMS[@]}" >&2 +echo "AUDIO_PR_CORPUS_PULL_DONE outdir=${OUTDIR} audio_prs=${#AUDIO_NUMS[@]} pulled=${pulled} skipped=${skipped} refresh=${REFRESH}" >&2 diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md index a9cc9548f9..16af214cb0 100644 --- a/.cursor/skills/review-curator-audio-pr/templates.md +++ b/.cursor/skills/review-curator-audio-pr/templates.md @@ -14,7 +14,7 @@ overview** (what the change does and why), then PR state, the diff, and all existing review activity. Complete the overview's plain-language summary first and present it before any findings; add your **Findings** section at the end. -```markdown +````markdown # Curator PR <N> Fresh Review - <YYYY-MM-DD> Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N> @@ -94,7 +94,7 @@ Total: <n> comments across <m> threads. ## Verdict <APPROVE / COMMENT / REQUEST CHANGES, and the blockers if any> -``` +```` ## B. Prior open threads (generated context) @@ -104,7 +104,7 @@ adding your own. This is **context, not your output** - you are not replying to these as the author; you read them to avoid duplicating and to check whether the author addressed them. -```markdown +````markdown # Curator PR <N> Open Review Threads - <YYYY-MM-DD> Current PR head: `<headRefOid>` @@ -121,7 +121,7 @@ Link: <html_url> ## Stale (outdated/resolved) threads - [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body> -``` +```` ## C. Writing your review comments @@ -145,6 +145,14 @@ Example review comment: > Please also add a test under `tests/stages/audio/io/` covering > `keep_waveform=True`. +When you have an exact fix, attach a GitHub `suggestion` block so the author can +apply it in one click (anchor the comment to the line(s) it replaces): + +```suggestion + record = {k: v for k, v in record.items() if k != "waveform"} + line = json.dumps(record) +``` + Then post the set as one PR review: inline comments for line-specific findings, a top-level summary carrying the overall verdict (APPROVE / COMMENT / REQUEST CHANGES). From 5a930a7b8a4c6b74eed8089fa354a18f49c5e363 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Tue, 9 Jun 2026 18:17:46 +0530 Subject: [PATCH 09/10] Require post-#1608 audio corpus build before every PR review. Make corpus pull + build_corpus mandatory in the skill workflow so reviewers cannot skip cross-PR pattern grounding when reviewing audio Curator PRs. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 69 ++++++++++++------- .../knowledge-sources.md | 21 +++--- .../review-curator-audio-pr/scripts/README.md | 8 ++- 3 files changed, 63 insertions(+), 35 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index 241a61a29e..df15ee36fb 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -3,10 +3,11 @@ name: review-curator-audio-pr description: >- Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo, - pulls the PR, diffs the changed audio code, applies Curator's audio stage - contracts and contribution standards, shows what other reviewers have already - raised (so you don't duplicate), and helps you produce a structured set of - review findings (P0-P3) to post as review comments. Use when you are assigned + pulls the PR, builds the post-#1608 audio review corpus, diffs the changed + audio code, applies Curator's audio stage contracts and contribution standards, + shows what other reviewers have already raised (so you don't duplicate), and + helps you produce a structured set of review findings (P0-P3) to post as + review comments. Use when you are assigned or pick up an audio Curator PR and need to review it (e.g. "review audio PR 1967", "do a review of this Curator audio PR", "what should I flag on PR 1898"). It is NOT for a PR author responding to a review. @@ -35,17 +36,18 @@ you already have and only shallow-clones A human reviewer invokes this skill with a prompt like: > **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or -> shallow-clone the repo, pull the PR and prior review activity, read the -> knowledge sources, then give me a detailed overview of what the PR does +> shallow-clone the repo, pull the PR and prior review activity, build the +> post-#1608 audio review corpus, read the knowledge sources, then give me a +> detailed overview of what the PR does > followed by my own findings (P0-P3) with exact `path:line` evidence and > concrete fixes, ready to post as PR comments. Skip anything other reviewers > already raised."** Variants: *"re-review PR <N> and show only what changed since the last review"*, *"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and -review the smallest one first"*. First-time in a repo, you can prime context -with *"build the post-#1608 audio review corpus first, then review PR <N>"* -(see [knowledge-sources.md](knowledge-sources.md) section 4). +review the smallest one first"*. Every review run must build (or refresh) the +post-#1608 audio review corpus before findings (see +[knowledge-sources.md](knowledge-sources.md) section 4). ## What you produce @@ -80,11 +82,12 @@ review comments. The review lenses and every doc/code reference live in - [ ] 0. Locate or shallow-clone the Curator repo - [ ] 1. Identify the PR and confirm it touches audio paths - [ ] 2. Pull fresh GitHub data (gh) -- [ ] 3. Read the diff -- [ ] 4. Read the knowledge sources (canonical docs + review lenses) to ground yourself -- [ ] 5. Generate the digest + prior-threads file for context -- [ ] 6. Write the detailed PR overview - only after steps 3-4 - present this first -- [ ] 7. Review through the lenses, write up findings (P0-P3), and post them after the overview +- [ ] 3. Build the post-#1608 audio review corpus (required) +- [ ] 4. Read the diff +- [ ] 5. Read the knowledge sources (canonical docs + review lenses + corpus) to ground yourself +- [ ] 6. Generate the digest + prior-threads file for context +- [ ] 7. Write the detailed PR overview - only after steps 4-5 - present this first +- [ ] 8. Review through the lenses, write up findings (P0-P3), and post them after the overview ``` ### Step 0 - Locate or shallow-clone the repo @@ -119,7 +122,24 @@ Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue `isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this to see prior review activity, not because you own the PR. -### Step 3 - Read the diff +### Step 3 - Build the post-#1608 audio review corpus (required) + +Before you explain or judge anything, build the consolidated reviewer-comment +corpus so you can spot cross-PR repeated mistakes. Do **not** skip this step, +even if a prior corpus file exists on disk - rerun the pull (incremental by +default) and render a fresh consolidated file for today's review: + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD> +``` + +Confirm `.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md` exists and +read it before writing findings. If the pull scripts fail, stop and fix the +failure - do not proceed with a review that skipped the corpus. See +[knowledge-sources.md](knowledge-sources.md) section 4 for details. + +### Step 4 - Read the diff ```bash gh pr diff <N> --repo NVIDIA-NeMo/Curator # works on a shallow clone @@ -130,7 +150,7 @@ gh pr diff <N> --repo NVIDIA-NeMo/Curator # works on a shallow clone shallow clone from step 0. Always review the actual diff, never the comment text alone. Use `main` as the baseline for "how the audio stages already do this". -### Step 4 - Read the knowledge sources (ground yourself first) +### Step 5 - Read the knowledge sources (ground yourself first) Before you explain or judge anything, read [knowledge-sources.md](knowledge-sources.md) so your overview and findings are @@ -140,14 +160,15 @@ grounded in the canonical material, not guesswork: (`nemo_curator/stages/audio/README.md`), the developer-guide slides, the published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`. - **section 1** - the audio code map, so you know where each changed file sits. -- **section 2** - the review lenses (you apply these in step 7). +- **section 2** - the review lenses (you apply these in step 8). +- **section 4 + corpus file** - the post-#1608 reviewer-comment corpus from step 3. Open the specific sources the diff touches (e.g. the fern page and the code for -the stage being changed). First time reviewing audio in this repo, optionally -build the post-#1608 reviewer-comment corpus (section 4) to see what reviewers -repeatedly raise. Only after this grounding do you write the overview (step 6). +the stage being changed). Check the corpus for recurring themes and whether this +PR repeats feedback already raised on other audio PRs. Only after this grounding +do you write the overview (step 7). -### Step 5 - Generate the context files +### Step 6 - Generate the context files ```bash .cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD> @@ -158,9 +179,9 @@ status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head), **RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer already made. -### Step 6 - Summarize the PR in detail (present this first) +### Step 7 - Summarize the PR in detail (present this first) -Now that you have read the diff (step 3) and the knowledge sources (step 4), +Now that you have read the diff (step 4) and the knowledge sources (step 5), write the **"What this PR does (overview)"** section of the digest so a reader understands the change end to end: the problem it solves; the main audio stages/files it adds or modifies (use the "Areas touched" table the @@ -168,7 +189,7 @@ builder generates); key design decisions; new dependencies, config, or APIs; and the blast radius (what could regress). Lead your review with this overview - the author and other reviewers read it before any findings. -### Step 7 - Review through the lenses, then write up and post your findings +### Step 8 - Review through the lenses, then write up and post your findings Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the diff - each lens links the audio code, README section, and `.cursor/rules` diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md index e2ef795727..5624b0d189 100644 --- a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md +++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md @@ -8,8 +8,9 @@ and has five parts: 2. **Review lenses** - what to check, each linked to the audio code, README section, and `.cursor/rules` contract it governs. 3. **Severity mapping** - P0-P3. -4. **Pre-review corpus** - pull every audio PR after #1608 (open + closed) with - reviewer comments, so you learn the recurring feedback before you review. +4. **Pre-review corpus (required)** - pull every audio PR after #1608 (open + + closed) with reviewer comments and read the consolidated file before you + write findings. 5. **GitHub data + scripts reference** - the `gh` endpoints and helper scripts. All paths are repo-relative. The skill is self-contained: this repository plus @@ -271,13 +272,14 @@ canonical sources to cite. --- -## 4. Pre-review corpus: learn from post-#1608 audio PRs +## 4. Pre-review corpus (required): learn from post-#1608 audio PRs PR [#1608](https://github.com/NVIDIA-NeMo/Curator/pull/1608) (`AudioBatch -> AudioTask` redesign) reset the audio stage contracts. Reviewer feedback on audio -PRs **after** #1608 is the best predictor of what to flag next. Before reviewing -a new PR - especially the first time you review audio in this repo - build the -consolidated corpus of that feedback: +PRs **after** #1608 is the best predictor of what to flag next. **Before every +audio PR review**, build (or refresh) the consolidated corpus of that feedback. +Do not skip this step, even when a prior `audio_pr_corpus_*.md` already exists +on disk - rerun the pull (incremental by default) and render today's file: ```bash # 1) discover audio PRs after #1608 (open + closed/merged) and pull their @@ -301,7 +303,8 @@ verbatim, anchored to `path:line`, plus a recurring-themes tally. Use the corpus to (a) recognize patterns reviewers repeatedly raise (the lenses in section 2 came from exactly this), and (b) check whether the PR in front of you repeats a mistake already called out elsewhere. It is read-only context; it -never auto-posts anything. Re-run periodically to keep it current. +never auto-posts anything. If the corpus scripts fail or the consolidated file +is missing, stop the review and fix the failure before writing findings. --- @@ -334,7 +337,7 @@ own findings). `scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]` (incremental - skips PRs already on disk; `--refresh` re-pulls) and -`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the -post-#1608 corpus (section 4). +`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` **must** run before +every review to build the post-#1608 corpus (section 4; SKILL.md step 3). Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete). diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md index e210b0c062..dd40c15ef0 100644 --- a/.cursor/skills/review-curator-audio-pr/scripts/README.md +++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md @@ -50,11 +50,15 @@ The thread join uses comment `databaseId` when present, else a (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify correctly. It also re-checks the audio-only guard and aborts on a non-audio PR. -## pull_audio_pr_corpus.sh + build_corpus.py (pre-review learning) +## pull_audio_pr_corpus.sh + build_corpus.py (required pre-review corpus) + +Run these **before every audio PR review** (SKILL.md step 3). Do not skip even +when `.curator-pr-review/audio-corpus/` already has data - the pull is +incremental and `build_corpus.py` must render today's consolidated file. ```bash .cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 -.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD> ``` `pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are From cfd4994495cfb608a66e181d82ca27f4a63ff3c0 Mon Sep 17 00:00:00 2001 From: "aaftaabv@gmail.com" <aaftaabv@gmail.com> Date: Tue, 9 Jun 2026 22:02:39 +0530 Subject: [PATCH 10/10] review-curator-audio-pr: document how to provide a gh-authenticated environment Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com> --- .../skills/review-curator-audio-pr/SKILL.md | 25 ++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md index df15ee36fb..4adb8ff4c1 100644 --- a/.cursor/skills/review-curator-audio-pr/SKILL.md +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -26,9 +26,23 @@ Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`, PR touches no audio path, the pull step refuses to run (step 2) - review it with a modality-appropriate tool instead. -Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and -`git`. You do **not** need to pre-clone the repo - step 0 reuses any checkout -you already have and only shallow-clones +Requirements: an environment with `git` and the GitHub CLI (`gh`) **already +authenticated** against `github.com`. Everything the skill does goes through +`gh`/`git`, so a working `gh` is the only credential you need. Verify with `gh +auth status`; if it is not green, give it an authenticated environment one of +these ways before invoking the skill: + +- **Interactive:** run `gh auth login` (choose GitHub.com, HTTPS) once; the token + is cached for future runs. +- **Token (CI / headless / sandbox):** export `GH_TOKEN` (or `GITHUB_TOKEN`) with + a PAT that has `repo` scope - e.g. `export GH_TOKEN=ghp_...` - then `gh auth + status` should pass. This is the way to hand a remote/agent runner a + `gh`-configured environment. +- **Existing config:** if you already have a `~/.config/gh/hosts.yml` (or a host + with `gh` set up), run the skill from that shell/user so it inherits the auth. + +You do **not** need to pre-clone the repo - step 0 reuses any checkout you +already have and only shallow-clones [NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found. ## The reviewer prompt @@ -49,6 +63,11 @@ review the smallest one first"*. Every review run must build (or refresh) the post-#1608 audio review corpus before findings (see [knowledge-sources.md](knowledge-sources.md) section 4). +Run the prompt in an environment where `gh auth status` passes. For a +remote/headless agent, say so explicitly, e.g. *"`gh` is authenticated via +`GH_TOKEN` in this environment - review audio Curator PR <N> ..."* (see +Requirements above for how to provide a `gh`-configured environment). + ## What you produce You produce two things, **in this order**: (1) a **detailed overview of what the