From 15fbe1d0b14229604e8aab2744ea968ce29e41fb Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 12:45:39 +0530
Subject: [PATCH 01/10] Add review-curator-pr Cursor skill

Add a self-contained Cursor Agent Skill under .cursor/skills/review-curator-pr
that helps contributors and maintainers review NVIDIA-NeMo/Curator pull
requests consistently.

The skill:
- pulls fresh PR data via the GitHub CLI (REST endpoints + GraphQL review
  threads, which carry isResolved/isOutdated);
- generates a structured review digest and a paste-ready open-comment queue;
- applies review lenses built on the repo's own .cursor/rules contracts and
  CONTRIBUTING.md standards (stage contracts, setup lifecycle, dependency
  hygiene, secret-safe logging, tests/coverage, PR reviewability).

Includes two helper scripts (scripts/pr_review_pull.sh, scripts/build_digest.py)
that write to a scratch .curator-pr-review/ directory, which is gitignored.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .cursor/skills/review-curator-pr/SKILL.md     | 140 +++++++++
 .../review-curator-pr/recurring-themes.md     | 127 ++++++++
 .cursor/skills/review-curator-pr/reference.md |  86 ++++++
 .../review-curator-pr/scripts/README.md       |  32 ++
 .../review-curator-pr/scripts/build_digest.py | 279 ++++++++++++++++++
 .../scripts/pr_review_pull.sh                 |  92 ++++++
 .cursor/skills/review-curator-pr/templates.md | 134 +++++++++
 .gitignore                                    |   3 +
 8 files changed, 893 insertions(+)
 create mode 100644 .cursor/skills/review-curator-pr/SKILL.md
 create mode 100644 .cursor/skills/review-curator-pr/recurring-themes.md
 create mode 100644 .cursor/skills/review-curator-pr/reference.md
 create mode 100644 .cursor/skills/review-curator-pr/scripts/README.md
 create mode 100755 .cursor/skills/review-curator-pr/scripts/build_digest.py
 create mode 100755 .cursor/skills/review-curator-pr/scripts/pr_review_pull.sh
 create mode 100644 .cursor/skills/review-curator-pr/templates.md
diff --git a/.cursor/skills/review-curator-pr/SKILL.md b/.cursor/skills/review-curator-pr/SKILL.md
new file mode 100644
index 0000000000..cf6a7eb657
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/SKILL.md
@@ -0,0 +1,140 @@
+---
+name: review-curator-pr
+description: >-
+  Review NVIDIA-NeMo/Curator pull requests against the project's stage
+  contracts and contribution standards. Pulls fresh PR data with the GitHub
+  CLI, diffs the changed code, applies Curator's documented review lenses, and
+  emits a structured review digest plus a paste-ready comment queue. Use when a
+  contributor or maintainer asks to review, re-review, or triage a Curator PR
+  (e.g. "review PR 1967", "what's still open on PR 1898", "triage this PR"), or
+  to draft replies to reviewer comments.
+---
+
+# Review NeMo Curator PRs
+
+Produce a defensible, reproducible review of an `NVIDIA-NeMo/Curator` PR from a
+fresh GitHub pull plus the code diff. The audience is the PR author (who must
+respond to reviewers and land the change) and maintainers triaging the queue.
+
+Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
+local checkout of this repository (the one containing this skill).
+
+## Two deliverables (always)
+
+Each review run writes two Markdown files into a scratch directory (default
+`.curator-pr-review/`, override with `--outdir`):
+
+1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state,
+   commits, changed-file table, reviews, every inline comment grouped by file
+   with OPEN / OUTDATED / RESOLVED status, and issue comments.
+2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root
+   threads that need an author reply or code change, paste-ready, with stale
+   threads listed at the bottom for traceability.
+
+See [templates.md](templates.md) for the exact section layout. Keeping the
+structure stable makes diffs across review passes mechanical, so a re-review
+only surfaces what changed.
+
+## Workflow
+
+```
+- [ ] 1. Identify the PR number
+- [ ] 2. Pull fresh GitHub data (gh)
+- [ ] 3. Check out the PR head and diff against the base
+- [ ] 4. Read the changed code through the review lenses
+- [ ] 5. Generate the digest + comment queue
+- [ ] 6. Classify findings P0-P3 and (optionally) draft replies
+```
+
+### Step 1 - Identify the PR
+
+Confirm the target: `gh pr view <N> --repo NVIDIA-NeMo/Curator`. If a prior
+review digest for this PR exists, read it first and record its head SHA as the
+baseline so commits and comments can be marked NEW vs already-seen.
+
+### Step 2 - Pull fresh GitHub data
+
+Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so
+prior pulls are preserved for delta analysis):
+
+```bash
+.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <N>
+```
+
+It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
+`isResolved` / `isOutdated` flags the REST inline endpoint omits.
+
+### Step 3 - Diff the code
+
+```bash
+gh pr checkout <N>                       # or: git fetch origin pull/<N>/head:pr-<N>
+git diff origin/main...HEAD             # changed files on this PR
+```
+
+Always review the actual diff, never the comment text alone. Use `main` as the
+baseline for "how the project already does this".
+
+### Step 4 - Read through the review lenses
+
+Apply the lenses in [recurring-themes.md](recurring-themes.md): stage
+contracts, setup/teardown lifecycle, dependency and optional-extra hygiene,
+secret-safe logging, memory/serialization of large payloads, streaming and
+performance, tutorials/docs, tests/coverage, and PR size/reviewability.
+
+These build on the project's own always-on rules in `.cursor/rules/`
+(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
+`executors`, `resources-configuration`, `pipeline-structure`,
+`modality-structure`, `coding-standards`) - treat those as the canonical
+contract and cite them when a change deviates.
+
+### Step 5 - Generate the two files
+
+```bash
+.cursor/skills/review-curator-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
+```
+
+The builder joins the pulled JSON and classifies each comment by thread status:
+**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**.
+The comment queue lists only OPEN root threads.
+
+### Step 6 - Findings and replies
+
+In the digest's findings section, classify each issue by severity:
+
+- **P0** - merge blocker: data loss / crash on a valid config, a secret leaked
+  into logs, or an import that breaks a supported install.
+- **P1** - fix before merge: undeclared runtime dependencies, missing required
+  tests/coverage, environment-specific code in a reusable path, a PR too large
+  to review safely.
+- **P2** - should fix: duplicated logic, inefficient batch handling, incomplete
+  metrics/observability, documented-but-unwired config.
+- **P3** - nice to have: temporary workarounds that need a test + comment,
+  debug-only knobs to document.
+
+When drafting replies, cite the exact `path:line-range` on the current head,
+state whether the point is already addressed (with the commit SHA) or will be,
+and keep each reply self-contained. See [templates.md](templates.md) section C.
+
+## Conventions to enforce (from CONTRIBUTING.md and rules)
+
+- **DCO sign-off** is required on every commit (`git commit -s`).
+- **Tests** must accompany source changes; GPU-only tests use
+  `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
+  and `tests/` mirrors the `nemo_curator/` layout.
+- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
+- **Ruff** lint/format, 119-char lines; **loguru** for logging.
+- Supported Python: 3.10-3.12.
+
+## Knowledge sources
+
+Full index in [reference.md](reference.md). Quick map:
+
+| Need | Source |
+|------|--------|
+| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` |
+| Modality guides | `nemo_curator/stages/<modality>/README.md`, `tutorials/<modality>/README.md` |
+| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
+| Perf expectations | `benchmarking/` |
+| Contribution rules | `CONTRIBUTING.md` |
+| Live PR data | `gh` (see scripts) |
diff --git a/.cursor/skills/review-curator-pr/recurring-themes.md b/.cursor/skills/review-curator-pr/recurring-themes.md
new file mode 100644
index 0000000000..d0599d2276
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/recurring-themes.md
@@ -0,0 +1,127 @@
+# Review lenses - recurring themes in Curator PRs
+
+Apply each lens to the diff and cite exact `path:line` evidence. These extend
+the always-on rules in `.cursor/rules/`; when a change violates a rule, link the
+rule by name. Most points are modality-agnostic; audio examples are
+illustrative.
+
+## Stage contracts and reuse
+
+- A stage subclasses `ProcessingStage[InputTask, OutputTask]` with explicit
+  `inputs()` / `outputs()`, a meaningful `name`, and appropriate `resources` /
+  `batch_size` (set as class attributes / dataclass fields, never as the
+  read-only `_name` / `_resources` / `_batch_size` properties).
+- GPU or vectorizable work belongs in `process_batch()` with a real
+  `batch_size`; `__init__` stays lightweight (it runs on the driver).
+- Reuse existing readers, writers, and shared model abstractions instead of
+  bespoke orchestration. Use `CompositeStage` to present several internal stages
+  as one user-facing stage.
+- If `inputs()` does not declare a required key, runtime validation will not
+  check it, and a direct `task.data[key]` access then raises an opaque
+  `KeyError` instead of a clear validation error. Declare required keys.
+- Fan-out stages (one task -> many) need deterministic child IDs and propagated
+  `_metadata` / `_stage_perf`. First-stage discovery from an empty input should
+  constrain placement so work is not duplicated across every worker.
+
+## Setup / teardown lifecycle and dependencies
+
+- `setup_on_node()` = once-per-node prep (e.g. model downloads / shared cache).
+  `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees
+  resources. Don't conflate them.
+- Import heavy or optional dependencies lazily inside `setup()` (or behind a
+  guarded import), never unconditionally at module top level, when the dependency
+  ships only with an optional extra. A top-level `import <optional_pkg>` breaks
+  `import nemo_curator...` on any install without that extra.
+- Declare the full runtime dependency set in the appropriate `pyproject.toml`
+  optional extra; add an import smoke test under that extra. Don't rely on an
+  external Docker image to supply dependencies a reusable stage needs.
+- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
+  shared across modalities (e.g. `transformers`, `huggingface-hub`,
+  `accelerate`); prefer compatible ranges so other extras don't break.
+
+## Secret-safe logging
+
+- Never log a full resolved config object; it can contain tokens / credentials
+  (e.g. an HF token, access keys). Redact secret-like keys before logging, or
+  log only a non-secret subset, and add a test asserting the secret value never
+  appears in logged output.
+
+## Memory and serialization of large payloads
+
+- Drop large in-memory payloads (waveform arrays, image/video tensors) from
+  `task.data` once a stage no longer needs them. Leaked arrays reach downstream
+  JSON writers and raise `TypeError: Object of type ndarray is not JSON
+  serializable` (which can fail an entire batch) or cause memory blowups. Strip
+  by key plus a duck-typed guard, and wrap `json.dumps` so a stray
+  non-serializable value reports the offending key instead of crashing.
+- Guard external I/O edge cases (e.g. `tarfile.extractfile()` returns `None` for
+  non-regular members even after `isfile()`; check before `.read()`).
+
+## Streaming, performance, benchmarking
+
+- Streaming execution with per-stage resource configuration is the default model
+  for scalable pipelines (see `executors.mdc`).
+- Emit stage metrics through the supported metrics path; keep records
+  JSON-shaped. For multi-node analysis, include worker/node/GPU identity and
+  latency distributions (p50/p95/p99), not just totals/averages.
+- Writers that loop per task inside `process_batch()` reopen output files
+  repeatedly; group by output target and open once per batch.
+- Standalone benchmark scripts belong in the `benchmarking/` flow with config
+  entries and comparable parameters - not freestanding.
+
+## Environment-specific code in reusable paths
+
+- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
+  cluster-specific paths) out of generic reader/writer code. Make remote access
+  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
+  opener/resolver, and isolate any environment-specific support behind a small,
+  documented, optional component.
+
+## Tutorials and docs
+
+- Tutorials must be runnable, minimal, and use public APIs only - no private
+  APIs, no machine-local absolute paths, no committed noisy notebook output, no
+  executor mismatch between a notebook and its CLI. Prefer `uv` for environment
+  setup if the tutorial standard does.
+- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
+  an `examples/` script; extract one shared builder both call.
+- Every config key documented for users must actually reach a stage; validate
+  config-to-stage mapping and fail (or warn) on unused keys.
+- Don't commit generated analysis/scratch docs into the source tree.
+
+## Tests and standards (CONTRIBUTING + coding-standards)
+
+- Source changes need accompanying tests; `tests/` mirrors `nemo_curator/`. The
+  project enforces 80% coverage under `nemo_curator/`. GPU-only tests use
+  `@pytest.mark.gpu`.
+- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
+- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
+- Every commit is DCO signed-off.
+
+## PR size and reviewability
+
+- Large multi-purpose PRs are hard to review and to defend. Prefer splitting by
+  stage / module / tutorial / benchmark. If a single PR must remain, organize it
+  as if sliced: no duplicated builder logic, clear module ownership, and a
+  reviewable commit sequence.
+
+## Smaller recurring nits
+
+- Missing copyright headers on new files.
+- Top-level imports that should be lazy/guarded.
+- Long inline strings (prompts, configs) that should be module constants.
+- Dataclass fields validated in `__post_init__` that could just be typed.
+- Dead conditionals / redundant truthiness checks.
+- Tests added for tutorials that don't warrant them.
+- Debug-only knobs presented as features - document them as debug throttles.
+- `trust_remote_code=True` (or similar) hardcoded - expose as a parameter so
+  security-conscious users can opt out.
+
+## Severity mapping
+
+| Severity | Meaning |
+|----------|---------|
+| P0 | Merge blocker: crash/data-loss on a valid config, secret leaked to logs, import breaks a supported install |
+| P1 | Fix before merge: undeclared deps, missing required tests/coverage, env-specific code in reusable path, PR too large |
+| P2 | Should fix: duplicated logic, inefficient batching, incomplete metrics, documented-but-unwired config |
+| P3 | Nice to have: temporary workaround needing test+comment, debug knob docs |
diff --git a/.cursor/skills/review-curator-pr/reference.md b/.cursor/skills/review-curator-pr/reference.md
new file mode 100644
index 0000000000..40bdd0c0cc
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/reference.md
@@ -0,0 +1,86 @@
+# Reference - NeMo Curator PR review
+
+All paths are relative to the repository root. The skill is self-contained: it
+relies only on this repository plus the GitHub CLI (`gh`).
+
+## Canonical contracts (`.cursor/rules/`)
+
+These always-on rules are the authoritative description of how stages and
+pipelines must be written. Cite them when a PR deviates.
+
+| Rule | Covers |
+|------|--------|
+| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
+| `composite-stage-patterns.mdc` | `CompositeStage` decomposition |
+| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
+| `executors.mdc` | Xenna / Ray Data backends and when to use each |
+| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
+| `pipeline-structure.mdc` | Pipeline assembly |
+| `modality-structure.mdc` | Where modality code/tests/tutorials live |
+| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
+
+## Framework code (diff ground truth)
+
+- `nemo_curator/stages/base.py` - `ProcessingStage` base class.
+- `nemo_curator/backends/` - `base.py`, `xenna/`, `ray_data/`,
+  `ray_actor_pool/`; adapters define how stages run on each backend.
+- `nemo_curator/pipeline/pipeline.py` - pipeline construction and guards.
+- `nemo_curator/tasks/` - task types.
+
+## Modality guides
+
+| Modality | Stage README | Tutorials |
+|----------|--------------|-----------|
+| audio | `nemo_curator/stages/audio/README.md` | `tutorials/audio/` |
+| interleaved | `nemo_curator/stages/interleaved/README.md` | `tutorials/interleaved/` |
+| text / image / video / math / synthetic | (stage code under `nemo_curator/stages/`) | `tutorials/{text,image,video,math,synthetic}/` |
+
+## Performance / benchmarking
+
+- `benchmarking/README.md` - how benchmarks are wired and run.
+- `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` - modality
+  perf expectations. Standalone benchmark scripts should be integrated here, not
+  left freestanding.
+
+## Contribution rules
+
+- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test
+  requirements, coverage threshold, copyright header.
+
+## GitHub data via `gh`
+
+`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir:
+
+| File | Endpoint |
+|------|----------|
+| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
+| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
+| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
+| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
+| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
+| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
+| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
+
+The REST inline-comments endpoint does not expose resolve/outdate state; the
+GraphQL thread payload does. The builder joins them by comment `databaseId`
+(== REST `id`), falling back to a `(path, body-prefix)` match when an older
+thread dump lacks `databaseId`.
+
+GraphQL query used by the pull script:
+
+```graphql
+query($owner:String!,$repo:String!,$pr:Int!){
+  repository(owner:$owner,name:$repo){
+    pullRequest(number:$pr){
+      reviewThreads(first:100){ nodes{
+        id isResolved isOutdated isCollapsed line originalLine path
+        comments(first:50){ nodes{ databaseId } }
+      }}}}}
+```
+
+## Scripts
+
+- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
+- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue.
+
+Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).
diff --git a/.cursor/skills/review-curator-pr/scripts/README.md b/.cursor/skills/review-curator-pr/scripts/README.md
new file mode 100644
index 0000000000..17f8f23d42
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/scripts/README.md
@@ -0,0 +1,32 @@
+# Scripts - review-curator-pr
+
+Both require the GitHub CLI (`gh`) authenticated against github.com. They write
+to a scratch directory (default `.curator-pr-review/`) that is safe to delete;
+do not commit its contents.
+
+## pr_review_pull.sh
+
+```bash
+.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
+```
+
+Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) plus the GraphQL review threads into
+`pr<N>_*_latest.json` (and timestamped snapshots for delta analysis).
+
+## build_digest.py
+
+```bash
+.cursor/skills/review-curator-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
+```
+
+Joins the `pr<N>_*_latest.json` files and writes
+`curator_pr<N>_fresh_review_<date>.md` + `curator_pr<N>_github_comment_queue_<date>.md`.
+
+- `--prev-head` sets the SHA in the "commits since prior reviewed head" header.
+- `--baseline-ts <TS>` (a prior pull's timestamp suffix) marks
+  comments/reviews/commits NEW vs already-seen.
+
+The thread join uses comment `databaseId` when present, else a
+(path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
+correctly.
diff --git a/.cursor/skills/review-curator-pr/scripts/build_digest.py b/.cursor/skills/review-curator-pr/scripts/build_digest.py
new file mode 100755
index 0000000000..42f654d4e9
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/scripts/build_digest.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""Build a Curator PR fresh-review digest + open-comment queue.
+
+Usage: build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD]
+                        [--prev-head SHA] [--baseline-ts TS]
+
+Reads the pr<N>_*_latest.json files written by pr_review_pull.sh from --outdir
+(default .curator-pr-review) and writes into the same directory:
+  curator_pr<N>_fresh_review_<date>.md          full digest
+  curator_pr<N>_github_comment_queue_<date>.md  open-thread paste queue
+
+The GraphQL thread payload is joined to REST inline comments by `databaseId`
+when present, otherwise by a (path, body-prefix) match, so both thread-dump
+shapes classify correctly. Pass --baseline-ts <TS> (a prior pull's timestamp
+suffix) to mark comments / reviews / commits NEW vs already-seen.
+"""
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import json
+from pathlib import Path
+
+BODY_KEY_LEN = 120
+
+
+def load(path: Path) -> object:
+    return json.loads(path.read_text())
+
+
+def load_baseline_ids(baseline, key):
+    if baseline is None or not baseline.exists():
+        return set()
+    return {entry[key] for entry in load(baseline) if key in entry}
+
+
+def build_thread_index(threads_payload):
+    """Return {by_dbid, by_pathbody, nthreads}, robust to both GraphQL shapes."""
+    idx = {"by_dbid": {}, "by_pathbody": {}, "nthreads": 0}
+    try:
+        nodes = threads_payload["data"]["repository"]["pullRequest"]["reviewThreads"]["nodes"]
+    except (KeyError, TypeError):
+        return idx
+    idx["nthreads"] = len(nodes)
+    for thread in nodes:
+        cnodes = thread["comments"]["nodes"]
+        path = thread.get("path")
+        for pos, c in enumerate(cnodes):
+            meta = {
+                "thread_id": thread.get("id"),
+                "is_resolved": thread.get("isResolved", False),
+                "is_outdated": thread.get("isOutdated", False),
+                "thread_comment_count": len(cnodes),
+                "is_first_in_thread": pos == 0,
+            }
+            db = c.get("databaseId")
+            if db is not None:
+                idx["by_dbid"][int(db)] = meta
+            body = (c.get("body") or "").strip()
+            if body:
+                idx["by_pathbody"].setdefault((path, body[:BODY_KEY_LEN]), meta)
+    return idx
+
+
+def thread_meta(idx, c):
+    m = idx["by_dbid"].get(c["id"])
+    if m is not None:
+        return m
+    body = (c.get("body") or "").strip()
+    return idx["by_pathbody"].get((c.get("path"), body[:BODY_KEY_LEN]))
+
+
+def shorten(body, n=600):
+    body = (body or "").strip()
+    if len(body) <= n:
+        return body
+    return body[:n].rstrip() + "\n[...truncated...]"
+
+
+def status_of(meta):
+    if meta is None:
+        return "ORPHAN"
+    if meta["is_resolved"]:
+        return "RESOLVED"
+    if meta["is_outdated"]:
+        return "OUTDATED"
+    return "OPEN"
+
+
+def p(parts, *lines):
+    parts.extend(lines)
+
+
+def build(pr, outdir, today, baseline_ts, prev_head):
+    def latest(kind):
+        return outdir / f"pr{pr}_{kind}_latest.json"
+
+    gh = load(latest("gh"))
+    reviews = load(latest("reviews"))
+    review_comments = load(latest("review_comments"))
+    issue_comments = load(latest("issue_comments"))
+    files = load(latest("files"))
+    commits = load(latest("commits"))
+    threads_path = latest("review_threads")
+    idx = build_thread_index(load(threads_path)) if threads_path.exists() else build_thread_index({})
+
+    def base(kind):
+        return outdir / f"pr{pr}_{kind}_{baseline_ts}.json" if baseline_ts else None
+
+    base_comment_ids = load_baseline_ids(base("review_comments"), "id")
+    base_review_ids = load_baseline_ids(base("reviews"), "id")
+    base_issue_ids = load_baseline_ids(base("issue_comments"), "id")
+    base_commit_shas = set()
+    bc = base("commits")
+    if bc and bc.exists():
+        base_commit_shas = {c["sha"] for c in load(bc) if "sha" in c}
+
+    head_oid = gh.get("headRefOid", "")
+    base_oid = gh.get("baseRefOid", "")
+    prev8 = (prev_head or "")[:8] or "(none)"
+    date_us = today.replace("-", "_")
+
+    # ----- DIGEST -----
+    d = []
+    p(d, f"# Curator PR {pr} Fresh Review - {today}", "")
+    p(d, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "")
+    p(d, f"Current PR head reviewed: `{head_oid}`", "")
+    p(d, f"Base recorded by GitHub metadata: `{base_oid}`", "")
+    if prev_head:
+        p(d, f"Previous reviewed head: `{prev_head}`", "")
+    p(d, "")
+
+    p(d, "## PR state at review time", "")
+    p(d, "| Field | Value |", "|---|---|")
+    p(d, f"| state | {gh.get('state')} |")
+    p(d, f"| isDraft | {gh.get('isDraft')} |")
+    p(d, f"| reviewDecision | {gh.get('reviewDecision') or '(none)'} |")
+    p(d, f"| mergeable / mergeStateStatus | {gh.get('mergeable')} / {gh.get('mergeStateStatus')} |")
+    p(d, f"| changedFiles | {gh.get('changedFiles')} |")
+    p(d, f"| additions / deletions | +{gh.get('additions')} / -{gh.get('deletions')} |")
+    p(d, f"| commits | {len(commits)} |")
+    p(d, f"| inline review comments total | {len(review_comments)} |")
+    p(d, f"| reviews submitted | {len(reviews)} |")
+    p(d, f"| top-level issue comments | {len(issue_comments)} |")
+    p(d, f"| updatedAt | {gh.get('updatedAt')} |", "")
+
+    new_commits = [c for c in commits if c["sha"] not in base_commit_shas] if base_commit_shas else commits
+    p(d, f"## Commits since prior reviewed head `{prev8}` ({len(new_commits)} new)", "")
+    p(d, "```")
+    for c in new_commits:
+        sha = c["sha"][:8]
+        msg = (c["commit"]["message"] or "").splitlines()[0]
+        cdate = (c["commit"]["author"] or {}).get("date", "")
+        p(d, f"{sha}  {msg}  ({cdate[:10]})")
+    p(d, "```", "")
+
+    p(d, "Files changed in current PR head (`git diff <base>..<head>`):", "")
+    p(d, "| File | +/- | status |", "|---|---|---|")
+    for f in sorted(files, key=lambda x: x["filename"]):
+        p(d, f"| `{f['filename']}` | +{f['additions']} -{f['deletions']} | {f['status']} |")
+    p(d, "")
+
+    p(d, "## Reviews", "")
+    review_by_id = {r["id"]: r for r in reviews}
+    for r in sorted(reviews, key=lambda x: x.get("submitted_at") or ""):
+        marker = " (NEW)" if base_review_ids and r["id"] not in base_review_ids else ""
+        commit = (r.get("commit_id") or "")[:8] or "n/a"
+        p(d, f"### #{r['id']} by @{r['user']['login']}  state={r['state']}  "
+             f"commit={commit}  submitted={r.get('submitted_at')}{marker}")
+        body = (r.get("body") or "").strip()
+        if body:
+            p(d, "", shorten(body, 1500))
+        p(d, "")
+
+    p(d, "## Inline review comments", "")
+    p(d, f"Total: {len(review_comments)} comments across {idx['nthreads'] or '?'} threads.", "")
+    by_status = {"OPEN": [], "OUTDATED": [], "RESOLVED": [], "ORPHAN": []}
+    for c in review_comments:
+        by_status[status_of(thread_meta(idx, c))].append(c)
+    for s in ("OPEN", "OUTDATED", "RESOLVED", "ORPHAN"):
+        p(d, f"- {s}: {len(by_status[s])}")
+    p(d, "")
+
+    by_path = {}
+    for c in review_comments:
+        by_path.setdefault(c["path"], []).append(c)
+    for path in sorted(by_path):
+        p(d, f"### `{path}`", "")
+        for c in sorted(by_path[path], key=lambda x: (x.get("line") or x.get("original_line") or 0, x["created_at"])):
+            meta = thread_meta(idx, c)
+            status = status_of(meta)
+            new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else ""
+            line = c.get("line") or c.get("original_line")
+            review = review_by_id.get(c.get("pull_request_review_id"), {})
+            commit = (c.get("commit_id") or "")[:8]
+            in_reply = c.get("in_reply_to_id")
+            reply_marker = f"  reply->{in_reply}" if in_reply else ""
+            tcount = meta["thread_comment_count"] if meta else 1
+            tpos = " (root)" if meta and meta["is_first_in_thread"] else (" (reply)" if meta else "")
+            p(d, f"- **#{c['id']}** @{c['user']['login']} {c['created_at']}  "
+                 f"line={line} commit={commit} status=**{status}**{new}  "
+                 f"thread_comments={tcount}{tpos} review_state={review.get('state', '?')}{reply_marker}")
+            p(d, f"  url: {c['html_url']}")
+            body = shorten(c.get("body") or "", 700).replace("\n", "\n  > ")
+            p(d, f"  > {body}", "")
+    p(d, "")
+
+    p(d, "## Top-level issue comments", "")
+    for c in sorted(issue_comments, key=lambda x: x["created_at"]):
+        new = " NEW" if base_issue_ids and c["id"] not in base_issue_ids else ""
+        p(d, f"### #{c['id']} by @{c['user']['login']}  {c['created_at']}{new}", "")
+        p(d, shorten(c.get("body") or "", 1500), "")
+
+    digest_path = outdir / f"curator_pr{pr}_fresh_review_{date_us}.md"
+    digest_path.write_text("\n".join(d) + "\n", encoding="utf-8")
+    print(f"wrote {digest_path}  ({digest_path.stat().st_size} bytes)")
+
+    # ----- COMMENT QUEUE (open root threads only) -----
+    q = []
+    p(q, f"# Curator PR {pr} GitHub Comment Queue - {today}", "")
+    p(q, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "")
+    p(q, f"Current PR head: `{head_oid}`", "")
+    p(q, "Open inline review threads from the most recent reviewer pass that need an "
+         "author response or a code change. Outdated/resolved threads are listed at "
+         "the bottom for traceability only.", "")
+
+    def is_root(c):
+        return (thread_meta(idx, c) or {}).get("is_first_in_thread", True)
+
+    open_root = [c for c in review_comments if status_of(thread_meta(idx, c)) == "OPEN" and is_root(c)]
+    open_root.sort(key=lambda c: (c["path"], c.get("line") or c.get("original_line") or 0, c["created_at"]))
+    for i, c in enumerate(open_root, 1):
+        meta = thread_meta(idx, c) or {}
+        line = c.get("line") or c.get("original_line")
+        new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else ""
+        p(q, f"## Comment {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "")
+        p(q, f"Thread: {c['html_url']}", "")
+        if meta.get("thread_comment_count", 1) > 1:
+            p(q, f"Thread has {meta['thread_comment_count']} comments - see digest for replies.", "")
+        p(q, f"File: `{c['path']}`", "")
+        p(q, f"Line: `{line}`", "")
+        p(q, "Reviewer text:", "")
+        p(q, "```text", (c.get("body") or "").strip(), "```", "")
+
+    p(q, "## Stale (outdated/resolved) comments", "")
+    p(q, "These threads pre-date the current head and are outdated/resolved. Listed "
+         "for traceability only - no action needed unless a reviewer reopens.", "")
+    stale = [c for c in review_comments
+             if status_of(thread_meta(idx, c)) in ("OUTDATED", "RESOLVED") and is_root(c)]
+    stale.sort(key=lambda c: (status_of(thread_meta(idx, c)), c["path"], c.get("original_line") or 0))
+    for c in stale:
+        line = c.get("line") or c.get("original_line")
+        body = shorten((c.get("body") or "").strip(), 200).replace("\n", " ")
+        p(q, f"- [{status_of(thread_meta(idx, c))}] `{c['path']}:{line}` "
+             f"@{c['user']['login']} ({c['html_url']}): {body}")
+    p(q, "")
+
+    queue_path = outdir / f"curator_pr{pr}_github_comment_queue_{date_us}.md"
+    queue_path.write_text("\n".join(q) + "\n", encoding="utf-8")
+    print(f"wrote {queue_path}  ({queue_path.stat().st_size} bytes)")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("pr", type=int, help="PR number")
+    ap.add_argument("--outdir", default=".curator-pr-review",
+                    help="directory holding pr<N>_*_latest.json (default .curator-pr-review)")
+    ap.add_argument("--today", default=None, help="ISO date for filenames; default today UTC")
+    ap.add_argument("--baseline-ts", default=None,
+                    help="timestamp suffix of a prior pull to mark NEW vs already-seen")
+    ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta")
+    args = ap.parse_args()
+    today = args.today or dt.datetime.utcnow().date().isoformat()
+    build(args.pr, Path(args.outdir), today, args.baseline_ts, args.prev_head)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh
new file mode 100755
index 0000000000..f14f649ed3
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh
@@ -0,0 +1,92 @@
+#!/usr/bin/env bash
+# Fetch GitHub data for an NVIDIA-NeMo/Curator PR review.
+#
+# Usage: pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
+#
+# Writes pr<N>_*_latest.json (consumed by build_digest.py) plus timestamped
+# snapshots so a prior pull is preserved for delta analysis. Pulls six REST
+# endpoints and the GraphQL review threads (which carry isResolved/isOutdated).
+#
+# Requires the GitHub CLI (`gh`) authenticated against github.com.
+set -euo pipefail
+
+PR=""
+REPO="NVIDIA-NeMo/Curator"
+OUTDIR=".curator-pr-review"
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --outdir) OUTDIR="$2"; shift 2 ;;
+        --repo)   REPO="$2";   shift 2 ;;
+        -h|--help)
+            echo "Usage: pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]"; exit 0 ;;
+        *) PR="$1"; shift ;;
+    esac
+done
+[[ -n "${PR}" ]] || { echo "error: PR number required" >&2; exit 2; }
+
+command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; }
+
+mkdir -p "${OUTDIR}"
+TS="$(date -u +%Y%m%dT%H%M%SZ)"
+LOG="${OUTDIR}/pr${PR}_review_pull_${TS}.log"
+
+{
+    echo "=== Pull start ${TS}  PR=${PR}  repo=${REPO} ==="
+    gh --version | head -1
+} | tee -a "${LOG}"
+
+pull_endpoint() {
+    local label="$1"; shift
+    local outfile="$1"; shift
+    echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}"
+    "$@" > "${outfile}"
+    printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}"
+}
+
+GH_FIELDS="number,title,state,isDraft,mergeable,mergeStateStatus,headRefName,headRefOid,baseRefName,baseRefOid,additions,deletions,changedFiles,commits,reviewDecision,reviewRequests,labels,milestone,createdAt,updatedAt,closedAt,mergedAt,author,body,url,statusCheckRollup"
+
+pull_endpoint "pr ${PR} metadata" "${OUTDIR}/pr${PR}_gh_${TS}.json" \
+    gh pr view "${PR}" --repo "${REPO}" --json "${GH_FIELDS}"
+pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"
+pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/comments"
+pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
+    gh api --paginate "repos/${REPO}/issues/${PR}/comments"
+pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/files"
+pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/commits"
+
+OWNER="${REPO%%/*}"
+NAME="${REPO##*/}"
+pull_endpoint "graphql reviewThreads" "${OUTDIR}/pr${PR}_review_threads_${TS}.json" \
+    gh api graphql \
+      -f query='query($owner:String!,$repo:String!,$pr:Int!){ repository(owner:$owner,name:$repo){ pullRequest(number:$pr){ reviewThreads(first:100){ nodes{ id isResolved isOutdated isCollapsed line originalLine path comments(first:50){ nodes{ databaseId } } } } } } }' \
+      -f owner="${OWNER}" -f repo="${NAME}" -F pr="${PR}"
+
+for kind in gh reviews review_comments issue_comments files commits review_threads; do
+    cp -f "${OUTDIR}/pr${PR}_${kind}_${TS}.json" "${OUTDIR}/pr${PR}_${kind}_latest.json"
+done
+
+{
+    echo "--- counts ---"
+    for kind in reviews review_comments issue_comments files commits; do
+        f="${OUTDIR}/pr${PR}_${kind}_${TS}.json"
+        n=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(len(d) if isinstance(d,list) else 1)" "${f}")
+        printf 'pr%s_%s: %s entries\n' "${PR}" "${kind}" "${n}"
+    done
+    echo "--- head SHA / activity ---"
+    python3 - "${OUTDIR}/pr${PR}_gh_latest.json" <<'PY'
+import json, sys
+gh = json.loads(open(sys.argv[1]).read())
+print(f"head_oid:   {gh.get('headRefOid')}")
+print(f"base_oid:   {gh.get('baseRefOid')}")
+print(f"state:      {gh.get('state')}  draft={gh.get('isDraft')}  mergeStatus={gh.get('mergeStateStatus')}  reviewDecision={gh.get('reviewDecision')}")
+print(f"counts:     files={gh.get('changedFiles')} +{gh.get('additions')}/-{gh.get('deletions')}")
+print(f"updated_at: {gh.get('updatedAt')}")
+PY
+} | tee -a "${LOG}"
+
+echo "PR${PR}_REVIEW_PULL_DONE  outdir=${OUTDIR}  log=${LOG}" | tee -a "${LOG}"
diff --git a/.cursor/skills/review-curator-pr/templates.md b/.cursor/skills/review-curator-pr/templates.md
new file mode 100644
index 0000000000..0cc6a86db6
--- /dev/null
+++ b/.cursor/skills/review-curator-pr/templates.md
@@ -0,0 +1,134 @@
+# Output templates
+
+`scripts/build_digest.py` emits both files automatically; these templates show
+the exact structure so hand edits and the added findings/verdict narrative stay
+consistent. Replace `<...>` placeholders. Keep the structure stable across
+review passes so re-reviews diff cleanly.
+
+## A. Fresh review digest
+
+File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`
+
+```markdown
+# Curator PR <N> Fresh Review - <YYYY-MM-DD>
+
+Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N>
+
+Current PR head reviewed: `<headRefOid>`
+
+Base recorded by GitHub metadata: `<baseRefOid>`
+
+Previous reviewed head: `<prior_head_sha>`   # if a prior review exists
+
+## PR state at review time
+
+| Field | Value |
+|---|---|
+| state | <OPEN/CLOSED/MERGED> |
+| isDraft | <bool> |
+| reviewDecision | <APPROVED/CHANGES_REQUESTED/none> |
+| mergeable / mergeStateStatus | <...> / <...> |
+| changedFiles | <n> |
+| additions / deletions | +<a> / -<d> |
+| commits | <n> |
+| inline review comments total | <n> |
+| reviews submitted | <n> |
+| top-level issue comments | <n> |
+| updatedAt | <iso> |
+
+## Commits since prior reviewed head `<prior8>` (<k> new)
+
+```
+<sha8>  <subject>  (<date>)
+```
+
+Files changed in current PR head (`git diff <base>..<head>`):
+
+| File | +/- | status |
+|---|---|---|
+| `<path>` | +<a> -<d> | <added/modified> |
+
+## Reviews
+
+### #<id> by @<login>  state=<STATE>  commit=<sha8>  submitted=<iso>
+<body or empty>
+
+## Inline review comments
+
+Total: <n> comments across <m> threads.
+
+- OPEN: <n>
+- OUTDATED: <n>
+- RESOLVED: <n>
+- ORPHAN: <n>
+
+### `<path>`
+
+- **#<id>** @<login> <iso>  line=<L> commit=<sha8> status=**<OPEN/OUTDATED/RESOLVED>**  thread_comments=<k> (<root/reply>) review_state=<STATE>
+  url: <html_url>
+  > <body, truncated, each line prefixed with "> ">
+
+## Top-level issue comments
+
+### #<id> by @<login>  <iso>
+<body, truncated>
+
+## Findings (reviewer analysis)
+
+### P0: <title>
+<evidence: path:line on current head> - <why blocker> - <recommended fix + test>
+
+### P1 / P2 / P3: <title>
+<...>
+
+## Verdict
+<merge-ready or not, and the blockers>
+```
+
+## B. GitHub comment queue
+
+File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. Only OPEN root
+threads needing an author response or code change; stale threads at the bottom.
+
+```markdown
+# Curator PR <N> GitHub Comment Queue - <YYYY-MM-DD>
+
+Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N>
+
+Current PR head: `<headRefOid>`
+
+## Comment <i> - <path>:<line> by @<login>
+
+Thread: <html_url>
+
+File: `<path>`
+
+Line: `<line>`
+
+Reviewer text:
+
+```text
+<verbatim reviewer body>
+```
+
+## Stale (outdated/resolved) comments
+
+- [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body>
+```
+
+## C. Reply drafting style
+
+For each OPEN comment the user wants a paste-ready reply for:
+
+- Acknowledge briefly.
+- State status: already addressed (cite the commit SHA) or will-fix.
+- Cite the exact `path:line-range` on the current head as evidence.
+- Describe the concrete mechanism, not just "fixed".
+- Keep each reply self-contained; the reviewer reads it in isolation.
+
+Example:
+
+> Thanks for flagging. Addressed on the current head (`<sha8>`): the writer now
+> drops the waveform key before serialisation and guards `json.dumps`, so a
+> non-serialisable value reports the offending key instead of failing the batch.
+> Citation: `nemo_curator/stages/audio/io/sharded_manifest_writer.py:96-111`.
diff --git a/.gitignore b/.gitignore
index 1ba2c180fd..f99780f76e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -169,3 +169,6 @@ fern/product-docs/
 
 # PDF parsing data (symlinked to external storage)
 pdf_parsing/
+
+# NeMo Curator PR review skill scratch output
+.curator-pr-review/

From 73e346daf72c4c24b7c5261978fec500dd5bb7b4 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 12:52:34 +0530
Subject: [PATCH 02/10] Scope review-curator skill to the audio modality

Rename the skill to review-curator-audio-pr and refocus its guidance on
audio-modality PRs (nemo_curator/stages/audio, tutorials/audio, audio
benchmarks, tests/stages/audio).

- SKILL.md: audio scope, step 1 confirms the PR touches audio paths, audio
  review lenses, audio-oriented P0-P3 examples.
- reference.md: audio stage map (io, preprocessing, inference asr/vad/
  speaker_diarization, filtering, segmentation, tagging, postprocessing,
  metrics, datasets, alm, advanced_pipelines), audio tutorials, audio
  benchmarks, audio tests/fixtures; anchored on stages/audio/README.md.
- recurring-themes.md: audio-specific lenses - waveform/tensor memory and
  manifest serialization, tarred/sharded I/O, optional audio deps, sample-rate
  and metadata propagation, throughput.
- scripts/README.md: updated paths.

Helper scripts are unchanged (they fetch any PR); the audio focus lives in the
review lenses.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 155 ++++++++++++++++++
 .../recurring-themes.md                       | 142 ++++++++++++++++
 .../review-curator-audio-pr/reference.md      | 105 ++++++++++++
 .../scripts/README.md                         |   9 +-
 .../scripts/build_digest.py                   |   0
 .../scripts/pr_review_pull.sh                 |   0
 .../templates.md                              |   0
 .cursor/skills/review-curator-pr/SKILL.md     | 140 ----------------
 .../review-curator-pr/recurring-themes.md     | 127 --------------
 .cursor/skills/review-curator-pr/reference.md |  86 ----------
 10 files changed, 407 insertions(+), 357 deletions(-)
 create mode 100644 .cursor/skills/review-curator-audio-pr/SKILL.md
 create mode 100644 .cursor/skills/review-curator-audio-pr/recurring-themes.md
 create mode 100644 .cursor/skills/review-curator-audio-pr/reference.md
 rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/scripts/README.md (69%)
 rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/scripts/build_digest.py (100%)
 rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/scripts/pr_review_pull.sh (100%)
 rename .cursor/skills/{review-curator-pr => review-curator-audio-pr}/templates.md (100%)
 delete mode 100644 .cursor/skills/review-curator-pr/SKILL.md
 delete mode 100644 .cursor/skills/review-curator-pr/recurring-themes.md
 delete mode 100644 .cursor/skills/review-curator-pr/reference.md

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
new file mode 100644
index 0000000000..646902b0fb
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -0,0 +1,155 @@
+---
+name: review-curator-audio-pr
+description: >-
+  Review NVIDIA-NeMo/Curator audio-modality pull requests - audio stages
+  (readers/writers, preprocessing, VAD, ASR, speaker diarization, filtering,
+  segmentation, tagging, ALM), audio tutorials, and audio benchmarks. Pulls
+  fresh PR data with the GitHub CLI, diffs the changed code, applies Curator's
+  audio stage contracts and contribution standards, and emits a structured
+  review digest plus a paste-ready comment queue. Use when a contributor or
+  maintainer asks to review, re-review, or triage an audio Curator PR (e.g.
+  "review audio PR 1967", "what's still open on the diarization PR"), or to
+  draft replies to reviewer comments on an audio PR.
+---
+
+# Review NeMo Curator audio PRs
+
+Produce a defensible, reproducible review of an audio-modality
+`NVIDIA-NeMo/Curator` PR from a fresh GitHub pull plus the code diff. Audience:
+the PR author (who must respond to reviewers and land the change) and
+maintainers triaging the audio queue.
+
+This skill is scoped to the **audio** modality: code under
+`nemo_curator/stages/audio/`, `tutorials/audio/`, audio tests under
+`tests/stages/audio/`, and audio benchmarks. For non-audio PRs, the stage-
+contract lenses still apply but the audio-specific guidance below will not.
+
+Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
+local checkout of this repository (the one containing this skill).
+
+## Two deliverables (always)
+
+Each review run writes two Markdown files into a scratch directory (default
+`.curator-pr-review/`, override with `--outdir`):
+
+1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state,
+   commits, changed-file table, reviews, every inline comment grouped by file
+   with OPEN / OUTDATED / RESOLVED status, and issue comments.
+2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root
+   threads that need an author reply or code change, paste-ready, with stale
+   threads listed at the bottom for traceability.
+
+See [templates.md](templates.md) for the exact section layout. Keeping the
+structure stable makes diffs across review passes mechanical, so a re-review
+only surfaces what changed.
+
+## Workflow
+
+```
+- [ ] 1. Identify the PR and confirm it touches audio paths
+- [ ] 2. Pull fresh GitHub data (gh)
+- [ ] 3. Check out the PR head and diff against the base
+- [ ] 4. Read the changed code through the audio review lenses
+- [ ] 5. Generate the digest + comment queue
+- [ ] 6. Classify findings P0-P3 and (optionally) draft replies
+```
+
+### Step 1 - Identify the PR
+
+`gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio
+paths (`nemo_curator/stages/audio/`, `tutorials/audio/`, `tests/stages/audio/`,
+`benchmarking/AUDIO_PROFILING.md` / `ALM_BENCHMARK.md`); if it does not, this
+audio skill is the wrong lens. If a prior review digest for this PR exists, read
+it first and record its head SHA as the baseline so commits and comments can be
+marked NEW vs already-seen.
+
+### Step 2 - Pull fresh GitHub data
+
+Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so
+prior pulls are preserved for delta analysis):
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N>
+```
+
+It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
+`isResolved` / `isOutdated` flags the REST inline endpoint omits.
+
+### Step 3 - Diff the code
+
+```bash
+gh pr checkout <N>                       # or: git fetch origin pull/<N>/head:pr-<N>
+git diff origin/main...HEAD             # changed files on this PR
+```
+
+Always review the actual diff, never the comment text alone. Use `main` as the
+baseline for "how the audio stages already do this".
+
+### Step 4 - Read through the audio review lenses
+
+Apply the lenses in [recurring-themes.md](recurring-themes.md): stage contracts,
+setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo,
+model utils), secret-safe logging, waveform/tensor memory and manifest
+serialization, tarred/sharded audio I/O, sample-rate and metadata propagation,
+streaming/throughput, tutorials/docs, tests/coverage, and PR reviewability.
+
+These build on the project's always-on rules in `.cursor/rules/`
+(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
+`executors`, `resources-configuration`, `pipeline-structure`,
+`modality-structure`, `coding-standards`) and the audio stage guide at
+`nemo_curator/stages/audio/README.md`. Treat those as canonical and cite them
+when a change deviates.
+
+### Step 5 - Generate the two files
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
+```
+
+The builder joins the pulled JSON and classifies each comment by thread status:
+**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**.
+The comment queue lists only OPEN root threads.
+
+### Step 6 - Findings and replies
+
+In the digest's findings section, classify each issue by severity:
+
+- **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a
+  manifest writer that crashes on a kept waveform), a secret (HF token) leaked
+  into logs, or an import that breaks a supported install.
+- **P1** - fix before merge: undeclared audio runtime dependencies, missing
+  required tests/coverage, environment-specific I/O in a reusable reader, a PR
+  too large to review safely.
+- **P2** - should fix: duplicated pipeline construction, inefficient per-task
+  writer batching, incomplete throughput metrics, documented-but-unwired config.
+- **P3** - nice to have: temporary model/compat workarounds needing a test +
+  comment, debug-only knobs (e.g. per-shard utterance caps) to document.
+
+When drafting replies, cite the exact `path:line-range` on the current head,
+state whether the point is already addressed (with the commit SHA) or will be,
+and keep each reply self-contained. See [templates.md](templates.md) section C.
+
+## Conventions to enforce (from CONTRIBUTING.md and rules)
+
+- **DCO sign-off** is required on every commit (`git commit -s`).
+- **Tests** must accompany source changes; GPU-only audio tests use
+  `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
+  and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`.
+- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
+- **Ruff** lint/format, 119-char lines; **loguru** for logging.
+- Supported Python: 3.10-3.12.
+
+## Knowledge sources
+
+Full index in [reference.md](reference.md). Quick map:
+
+| Need | Source |
+|------|--------|
+| Audio stage guide | `nemo_curator/stages/audio/README.md` |
+| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` |
+| Audio tutorials | `tutorials/audio/` |
+| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
+| Audio perf expectations | `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` |
+| Contribution rules | `CONTRIBUTING.md` |
+| Live PR data | `gh` (see scripts) |
diff --git a/.cursor/skills/review-curator-audio-pr/recurring-themes.md b/.cursor/skills/review-curator-audio-pr/recurring-themes.md
new file mode 100644
index 0000000000..38b5c826e9
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/recurring-themes.md
@@ -0,0 +1,142 @@
+# Review lenses - recurring themes in Curator audio PRs
+
+Apply each lens to the diff and cite exact `path:line` evidence. These extend
+the always-on rules in `.cursor/rules/` and the audio guide at
+`nemo_curator/stages/audio/README.md`; when a change violates a rule, link it by
+name.
+
+## Stage contracts and reuse
+
+- An audio stage subclasses `ProcessingStage[InputTask, OutputTask]` with
+  explicit `inputs()` / `outputs()`, a meaningful `name`, and appropriate
+  `resources` / `batch_size` (set as class attributes / dataclass fields, never
+  as the read-only `_name` / `_resources` / `_batch_size` properties).
+- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in
+  `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it
+  runs on the driver).
+- Reuse existing audio readers, manifest writers, and shared model abstractions
+  instead of bespoke orchestration. Model a tarred/sharded reader as a
+  `CompositeStage` = a discovery stage plus a shard-reader fan-out.
+- If `inputs()` does not declare a required key (e.g. `waveform_key`,
+  `sample_rate_key`), runtime validation will not check it, and a direct
+  `task.data[key]` access then raises an opaque `KeyError` instead of a clear
+  validation error. Declare required keys.
+- Fan-out shard readers need deterministic child IDs and must propagate
+  `_metadata` / `_stage_perf`. First-stage shard discovery from an empty input
+  should constrain placement (e.g. one worker per node) so discovery is not
+  duplicated across every worker.
+
+## Setup / teardown lifecycle and audio dependencies
+
+- `setup_on_node()` = once-per-node prep (model download / shared cache).
+  `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees
+  GPU memory. Don't conflate them.
+- Import heavy/optional audio deps (vLLM, NeMo, model-specific utils,
+  soundfile/lhotse, fasttext) lazily inside `setup()` or behind a guarded
+  import, never unconditionally at module top level, when they ship only with an
+  optional extra. A top-level `import <optional_pkg>` breaks
+  `import nemo_curator.stages.audio...` on any install without that extra
+  (including CPU-only / Mac/ARM).
+- Declare the full audio runtime dependency set in the appropriate
+  `pyproject.toml` optional extra (e.g. an audio CUDA extra); add an import
+  smoke test under that extra. Don't rely on an external Docker image to supply
+  dependencies a reusable audio stage needs.
+- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
+  shared across modalities (`transformers`, `huggingface-hub`, `accelerate`);
+  prefer compatible ranges so other extras don't break.
+
+## Secret-safe logging
+
+- Never log a full resolved config; audio tutorials commonly accept an HF token
+  / cloud credentials. Redact secret-like keys before logging (token, password,
+  secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset,
+  and add a test asserting the secret value never appears in logged output.
+
+## Waveform/tensor memory and manifest serialization
+
+- Drop waveform arrays / audio tensors from `task.data` once a stage no longer
+  needs them. A leaked numpy `ndarray` or torch tensor that reaches a JSON
+  manifest writer raises `TypeError: Object of type ndarray is not JSON
+  serializable` and can fail the entire shard/batch, or cause OOM at scale.
+  Strip by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap
+  `json.dumps` so a stray non-serializable value reports the offending key.
+- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for
+  non-regular members even after `isfile()` - check before `.read()`. Handle
+  resampling / channel-count / dtype mismatches explicitly.
+
+## Tarred / sharded audio I/O
+
+- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
+  cluster-local paths) out of the generic reader/writer. Make remote access
+  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
+  opener/resolver, and isolate any environment-specific support behind a small,
+  documented, optional component. For piped/streamed downloads use
+  failure-aware flags + retries so truncated streams fail loudly.
+- Manifest/shard writers that loop per task inside `process_batch()` reopen the
+  output file repeatedly; group by shard and open each output once per batch.
+  Preserve `.done`/checkpoint markers so runs are resumable.
+
+## Streaming, throughput, benchmarking
+
+- Streaming execution with per-stage resource configuration is the default
+  model for scalable audio pipelines (see `executors.mdc`).
+- Emit stage metrics through the supported metrics path (`metrics/`), JSON-
+  shaped. For multi-node audio runs include worker/node/GPU identity and latency
+  distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just
+  totals/averages.
+- Standalone audio benchmark scripts belong in the `benchmarking/` flow
+  (`AUDIO_PROFILING.md`, `ALM_BENCHMARK.md`) with config entries and comparable
+  parameters - not freestanding.
+
+## Tutorials and docs
+
+- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public
+  APIs only - no private APIs, no machine-local absolute paths, no committed
+  noisy notebook output, no executor mismatch between a notebook and its CLI.
+  Prefer `uv` for environment setup if the tutorial standard does.
+- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
+  an `examples/` script; extract one shared builder both call.
+- Every config key documented for users must actually reach a stage; validate
+  config-to-stage mapping and fail (or warn) on unused keys.
+- Don't commit generated analysis/scratch docs into the source tree.
+
+## Tests and standards (CONTRIBUTING + coding-standards)
+
+- Source changes need accompanying tests; `tests/stages/audio/` mirrors
+  `nemo_curator/stages/audio/`. The project enforces 80% coverage under
+  `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures
+  under `tests/fixtures/audio/` and sample data under `tests/data/audio/`.
+- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
+- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
+- Every commit is DCO signed-off.
+
+## PR size and reviewability
+
+- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by
+  stage / module / tutorial / benchmark (e.g. reader+writer plumbing, then
+  inference stages, then postprocessing, then tutorial/config). If a single PR
+  must remain, organize it as if sliced: no duplicated builder logic, clear
+  module ownership, a reviewable commit sequence.
+
+## Smaller recurring nits
+
+- Missing copyright headers on new files.
+- Top-level imports that should be lazy/guarded (very common on audio model
+  modules).
+- Long inline prompt/config strings that should be module constants.
+- Dataclass fields validated in `__post_init__` that could just be typed.
+- Dead conditionals / redundant truthiness checks.
+- Pytests added for tutorials that don't warrant them.
+- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features -
+  document them as debug throttles.
+- `trust_remote_code=True` hardcoded on a model load - expose as a parameter so
+  security-conscious users can opt out.
+
+## Severity mapping
+
+| Severity | Meaning |
+|----------|---------|
+| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install |
+| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large |
+| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config |
+| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs |
diff --git a/.cursor/skills/review-curator-audio-pr/reference.md b/.cursor/skills/review-curator-audio-pr/reference.md
new file mode 100644
index 0000000000..b6ccf9ee7a
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/reference.md
@@ -0,0 +1,105 @@
+# Reference - NeMo Curator audio PR review
+
+All paths are relative to the repository root. The skill is self-contained: it
+relies only on this repository plus the GitHub CLI (`gh`). Scope: the **audio**
+modality.
+
+## Primary audio guide
+
+- `nemo_curator/stages/audio/README.md` - audio stage contracts, `process` vs
+  `process_batch`, lazy/optional imports, metrics. Read this first for any
+  audio review.
+
+## Canonical contracts (`.cursor/rules/`)
+
+Always-on rules describing how stages and pipelines must be written. Cite them
+when a PR deviates.
+
+| Rule | Covers |
+|------|--------|
+| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
+| `composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) |
+| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
+| `executors.mdc` | Xenna / Ray Data backends and when to use each |
+| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
+| `pipeline-structure.mdc` | Pipeline assembly |
+| `modality-structure.mdc` | Where modality code/tests/tutorials live |
+| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
+
+## Audio stage map (diff ground truth)
+
+`nemo_curator/stages/audio/` is organized as:
+
+| Area | Path |
+|------|------|
+| Base contract | `nemo_curator/stages/base.py` |
+| I/O (tarred/sharded readers, manifest writers) | `nemo_curator/stages/audio/io/` |
+| Preprocessing | `nemo_curator/stages/audio/preprocessing/` |
+| Inference: ASR / VAD / speaker diarization | `nemo_curator/stages/audio/inference/{asr,vad,speaker_diarization}/` |
+| Filtering (band, SigMOS, etc.) | `nemo_curator/stages/audio/filtering/` |
+| Segmentation (speaker separation) | `nemo_curator/stages/audio/segmentation/` |
+| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/` |
+| Postprocessing | `nemo_curator/stages/audio/postprocessing/` |
+| Metrics / throughput | `nemo_curator/stages/audio/metrics/` |
+| Datasets (fleurs, readspeech) | `nemo_curator/stages/audio/datasets/` |
+| ALM (audio-language pretrain) | `nemo_curator/stages/audio/alm/` |
+| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/` |
+
+Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`,
+`ray_data/`, `ray_actor_pool/`); pipeline assembly:
+`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`.
+
+## Audio tutorials
+
+`tutorials/audio/` - `alm/`, `audio_pretrain/`, `callhome_diar/`, `fleurs/`,
+`readspeech/`, `single_speaker_filter/`, `tagging/`. Tutorials must be runnable,
+use public APIs only, and not duplicate pipeline-building logic that belongs in
+a shared builder.
+
+## Audio performance / benchmarking
+
+- `benchmarking/README.md` - how benchmarks are wired and run.
+- `benchmarking/AUDIO_PROFILING.md` - audio benchmark / perf expectations.
+- `benchmarking/ALM_BENCHMARK.md` - ALM pipeline benchmark.
+
+Standalone audio benchmark scripts should be integrated into this flow, not left
+freestanding.
+
+## Audio tests
+
+`tests/stages/audio/` mirrors `nemo_curator/stages/audio/` (io, inference,
+filtering, segmentation, tagging, postprocessing, preprocessing, metrics,
+datasets, alm, advanced_pipelines). Audio fixtures live under
+`tests/fixtures/audio/` and `tests/data/audio/`. GPU-only tests are marked
+`@pytest.mark.gpu`.
+
+## Contribution rules
+
+- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test
+  requirements, coverage threshold, copyright header.
+
+## GitHub data via `gh`
+
+`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir:
+
+| File | Endpoint |
+|------|----------|
+| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
+| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
+| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
+| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
+| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
+| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
+| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
+
+The REST inline-comments endpoint does not expose resolve/outdate state; the
+GraphQL thread payload does. The builder joins them by comment `databaseId`
+(== REST `id`), falling back to a `(path, body-prefix)` match when an older
+thread dump lacks `databaseId`.
+
+## Scripts
+
+- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
+- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue.
+
+Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).
diff --git a/.cursor/skills/review-curator-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md
similarity index 69%
rename from .cursor/skills/review-curator-pr/scripts/README.md
rename to .cursor/skills/review-curator-audio-pr/scripts/README.md
index 17f8f23d42..b729dee15c 100644
--- a/.cursor/skills/review-curator-pr/scripts/README.md
+++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md
@@ -1,4 +1,4 @@
-# Scripts - review-curator-pr
+# Scripts - review-curator-audio-pr
 
 Both require the GitHub CLI (`gh`) authenticated against github.com. They write
 to a scratch directory (default `.curator-pr-review/`) that is safe to delete;
@@ -7,7 +7,7 @@ do not commit its contents.
 ## pr_review_pull.sh
 
 ```bash
-.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
+.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
 ```
 
 Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
@@ -17,7 +17,7 @@ Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
 ## build_digest.py
 
 ```bash
-.cursor/skills/review-curator-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
+.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
 ```
 
 Joins the `pr<N>_*_latest.json` files and writes
@@ -29,4 +29,5 @@ Joins the `pr<N>_*_latest.json` files and writes
 
 The thread join uses comment `databaseId` when present, else a
 (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
-correctly.
+correctly. The scripts fetch any PR; the audio focus is applied in the review
+lenses (see ../recurring-themes.md).
diff --git a/.cursor/skills/review-curator-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
similarity index 100%
rename from .cursor/skills/review-curator-pr/scripts/build_digest.py
rename to .cursor/skills/review-curator-audio-pr/scripts/build_digest.py
diff --git a/.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
similarity index 100%
rename from .cursor/skills/review-curator-pr/scripts/pr_review_pull.sh
rename to .cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
diff --git a/.cursor/skills/review-curator-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md
similarity index 100%
rename from .cursor/skills/review-curator-pr/templates.md
rename to .cursor/skills/review-curator-audio-pr/templates.md
diff --git a/.cursor/skills/review-curator-pr/SKILL.md b/.cursor/skills/review-curator-pr/SKILL.md
deleted file mode 100644
index cf6a7eb657..0000000000
--- a/.cursor/skills/review-curator-pr/SKILL.md
+++ /dev/null
@@ -1,140 +0,0 @@
----
-name: review-curator-pr
-description: >-
-  Review NVIDIA-NeMo/Curator pull requests against the project's stage
-  contracts and contribution standards. Pulls fresh PR data with the GitHub
-  CLI, diffs the changed code, applies Curator's documented review lenses, and
-  emits a structured review digest plus a paste-ready comment queue. Use when a
-  contributor or maintainer asks to review, re-review, or triage a Curator PR
-  (e.g. "review PR 1967", "what's still open on PR 1898", "triage this PR"), or
-  to draft replies to reviewer comments.
----
-
-# Review NeMo Curator PRs
-
-Produce a defensible, reproducible review of an `NVIDIA-NeMo/Curator` PR from a
-fresh GitHub pull plus the code diff. The audience is the PR author (who must
-respond to reviewers and land the change) and maintainers triaging the queue.
-
-Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
-local checkout of this repository (the one containing this skill).
-
-## Two deliverables (always)
-
-Each review run writes two Markdown files into a scratch directory (default
-`.curator-pr-review/`, override with `--outdir`):
-
-1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state,
-   commits, changed-file table, reviews, every inline comment grouped by file
-   with OPEN / OUTDATED / RESOLVED status, and issue comments.
-2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root
-   threads that need an author reply or code change, paste-ready, with stale
-   threads listed at the bottom for traceability.
-
-See [templates.md](templates.md) for the exact section layout. Keeping the
-structure stable makes diffs across review passes mechanical, so a re-review
-only surfaces what changed.
-
-## Workflow
-
-```
-- [ ] 1. Identify the PR number
-- [ ] 2. Pull fresh GitHub data (gh)
-- [ ] 3. Check out the PR head and diff against the base
-- [ ] 4. Read the changed code through the review lenses
-- [ ] 5. Generate the digest + comment queue
-- [ ] 6. Classify findings P0-P3 and (optionally) draft replies
-```
-
-### Step 1 - Identify the PR
-
-Confirm the target: `gh pr view <N> --repo NVIDIA-NeMo/Curator`. If a prior
-review digest for this PR exists, read it first and record its head SHA as the
-baseline so commits and comments can be marked NEW vs already-seen.
-
-### Step 2 - Pull fresh GitHub data
-
-Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so
-prior pulls are preserved for delta analysis):
-
-```bash
-.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <N>
-```
-
-It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
-`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
-`isResolved` / `isOutdated` flags the REST inline endpoint omits.
-
-### Step 3 - Diff the code
-
-```bash
-gh pr checkout <N>                       # or: git fetch origin pull/<N>/head:pr-<N>
-git diff origin/main...HEAD             # changed files on this PR
-```
-
-Always review the actual diff, never the comment text alone. Use `main` as the
-baseline for "how the project already does this".
-
-### Step 4 - Read through the review lenses
-
-Apply the lenses in [recurring-themes.md](recurring-themes.md): stage
-contracts, setup/teardown lifecycle, dependency and optional-extra hygiene,
-secret-safe logging, memory/serialization of large payloads, streaming and
-performance, tutorials/docs, tests/coverage, and PR size/reviewability.
-
-These build on the project's own always-on rules in `.cursor/rules/`
-(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
-`executors`, `resources-configuration`, `pipeline-structure`,
-`modality-structure`, `coding-standards`) - treat those as the canonical
-contract and cite them when a change deviates.
-
-### Step 5 - Generate the two files
-
-```bash
-.cursor/skills/review-curator-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
-```
-
-The builder joins the pulled JSON and classifies each comment by thread status:
-**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**.
-The comment queue lists only OPEN root threads.
-
-### Step 6 - Findings and replies
-
-In the digest's findings section, classify each issue by severity:
-
-- **P0** - merge blocker: data loss / crash on a valid config, a secret leaked
-  into logs, or an import that breaks a supported install.
-- **P1** - fix before merge: undeclared runtime dependencies, missing required
-  tests/coverage, environment-specific code in a reusable path, a PR too large
-  to review safely.
-- **P2** - should fix: duplicated logic, inefficient batch handling, incomplete
-  metrics/observability, documented-but-unwired config.
-- **P3** - nice to have: temporary workarounds that need a test + comment,
-  debug-only knobs to document.
-
-When drafting replies, cite the exact `path:line-range` on the current head,
-state whether the point is already addressed (with the commit SHA) or will be,
-and keep each reply self-contained. See [templates.md](templates.md) section C.
-
-## Conventions to enforce (from CONTRIBUTING.md and rules)
-
-- **DCO sign-off** is required on every commit (`git commit -s`).
-- **Tests** must accompany source changes; GPU-only tests use
-  `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
-  and `tests/` mirrors the `nemo_curator/` layout.
-- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
-- **Ruff** lint/format, 119-char lines; **loguru** for logging.
-- Supported Python: 3.10-3.12.
-
-## Knowledge sources
-
-Full index in [reference.md](reference.md). Quick map:
-
-| Need | Source |
-|------|--------|
-| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` |
-| Modality guides | `nemo_curator/stages/<modality>/README.md`, `tutorials/<modality>/README.md` |
-| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
-| Perf expectations | `benchmarking/` |
-| Contribution rules | `CONTRIBUTING.md` |
-| Live PR data | `gh` (see scripts) |
diff --git a/.cursor/skills/review-curator-pr/recurring-themes.md b/.cursor/skills/review-curator-pr/recurring-themes.md
deleted file mode 100644
index d0599d2276..0000000000
--- a/.cursor/skills/review-curator-pr/recurring-themes.md
+++ /dev/null
@@ -1,127 +0,0 @@
-# Review lenses - recurring themes in Curator PRs
-
-Apply each lens to the diff and cite exact `path:line` evidence. These extend
-the always-on rules in `.cursor/rules/`; when a change violates a rule, link the
-rule by name. Most points are modality-agnostic; audio examples are
-illustrative.
-
-## Stage contracts and reuse
-
-- A stage subclasses `ProcessingStage[InputTask, OutputTask]` with explicit
-  `inputs()` / `outputs()`, a meaningful `name`, and appropriate `resources` /
-  `batch_size` (set as class attributes / dataclass fields, never as the
-  read-only `_name` / `_resources` / `_batch_size` properties).
-- GPU or vectorizable work belongs in `process_batch()` with a real
-  `batch_size`; `__init__` stays lightweight (it runs on the driver).
-- Reuse existing readers, writers, and shared model abstractions instead of
-  bespoke orchestration. Use `CompositeStage` to present several internal stages
-  as one user-facing stage.
-- If `inputs()` does not declare a required key, runtime validation will not
-  check it, and a direct `task.data[key]` access then raises an opaque
-  `KeyError` instead of a clear validation error. Declare required keys.
-- Fan-out stages (one task -> many) need deterministic child IDs and propagated
-  `_metadata` / `_stage_perf`. First-stage discovery from an empty input should
-  constrain placement so work is not duplicated across every worker.
-
-## Setup / teardown lifecycle and dependencies
-
-- `setup_on_node()` = once-per-node prep (e.g. model downloads / shared cache).
-  `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees
-  resources. Don't conflate them.
-- Import heavy or optional dependencies lazily inside `setup()` (or behind a
-  guarded import), never unconditionally at module top level, when the dependency
-  ships only with an optional extra. A top-level `import <optional_pkg>` breaks
-  `import nemo_curator...` on any install without that extra.
-- Declare the full runtime dependency set in the appropriate `pyproject.toml`
-  optional extra; add an import smoke test under that extra. Don't rely on an
-  external Docker image to supply dependencies a reusable stage needs.
-- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
-  shared across modalities (e.g. `transformers`, `huggingface-hub`,
-  `accelerate`); prefer compatible ranges so other extras don't break.
-
-## Secret-safe logging
-
-- Never log a full resolved config object; it can contain tokens / credentials
-  (e.g. an HF token, access keys). Redact secret-like keys before logging, or
-  log only a non-secret subset, and add a test asserting the secret value never
-  appears in logged output.
-
-## Memory and serialization of large payloads
-
-- Drop large in-memory payloads (waveform arrays, image/video tensors) from
-  `task.data` once a stage no longer needs them. Leaked arrays reach downstream
-  JSON writers and raise `TypeError: Object of type ndarray is not JSON
-  serializable` (which can fail an entire batch) or cause memory blowups. Strip
-  by key plus a duck-typed guard, and wrap `json.dumps` so a stray
-  non-serializable value reports the offending key instead of crashing.
-- Guard external I/O edge cases (e.g. `tarfile.extractfile()` returns `None` for
-  non-regular members even after `isfile()`; check before `.read()`).
-
-## Streaming, performance, benchmarking
-
-- Streaming execution with per-stage resource configuration is the default model
-  for scalable pipelines (see `executors.mdc`).
-- Emit stage metrics through the supported metrics path; keep records
-  JSON-shaped. For multi-node analysis, include worker/node/GPU identity and
-  latency distributions (p50/p95/p99), not just totals/averages.
-- Writers that loop per task inside `process_batch()` reopen output files
-  repeatedly; group by output target and open once per batch.
-- Standalone benchmark scripts belong in the `benchmarking/` flow with config
-  entries and comparable parameters - not freestanding.
-
-## Environment-specific code in reusable paths
-
-- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
-  cluster-specific paths) out of generic reader/writer code. Make remote access
-  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
-  opener/resolver, and isolate any environment-specific support behind a small,
-  documented, optional component.
-
-## Tutorials and docs
-
-- Tutorials must be runnable, minimal, and use public APIs only - no private
-  APIs, no machine-local absolute paths, no committed noisy notebook output, no
-  executor mismatch between a notebook and its CLI. Prefer `uv` for environment
-  setup if the tutorial standard does.
-- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
-  an `examples/` script; extract one shared builder both call.
-- Every config key documented for users must actually reach a stage; validate
-  config-to-stage mapping and fail (or warn) on unused keys.
-- Don't commit generated analysis/scratch docs into the source tree.
-
-## Tests and standards (CONTRIBUTING + coding-standards)
-
-- Source changes need accompanying tests; `tests/` mirrors `nemo_curator/`. The
-  project enforces 80% coverage under `nemo_curator/`. GPU-only tests use
-  `@pytest.mark.gpu`.
-- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
-- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
-- Every commit is DCO signed-off.
-
-## PR size and reviewability
-
-- Large multi-purpose PRs are hard to review and to defend. Prefer splitting by
-  stage / module / tutorial / benchmark. If a single PR must remain, organize it
-  as if sliced: no duplicated builder logic, clear module ownership, and a
-  reviewable commit sequence.
-
-## Smaller recurring nits
-
-- Missing copyright headers on new files.
-- Top-level imports that should be lazy/guarded.
-- Long inline strings (prompts, configs) that should be module constants.
-- Dataclass fields validated in `__post_init__` that could just be typed.
-- Dead conditionals / redundant truthiness checks.
-- Tests added for tutorials that don't warrant them.
-- Debug-only knobs presented as features - document them as debug throttles.
-- `trust_remote_code=True` (or similar) hardcoded - expose as a parameter so
-  security-conscious users can opt out.
-
-## Severity mapping
-
-| Severity | Meaning |
-|----------|---------|
-| P0 | Merge blocker: crash/data-loss on a valid config, secret leaked to logs, import breaks a supported install |
-| P1 | Fix before merge: undeclared deps, missing required tests/coverage, env-specific code in reusable path, PR too large |
-| P2 | Should fix: duplicated logic, inefficient batching, incomplete metrics, documented-but-unwired config |
-| P3 | Nice to have: temporary workaround needing test+comment, debug knob docs |
diff --git a/.cursor/skills/review-curator-pr/reference.md b/.cursor/skills/review-curator-pr/reference.md
deleted file mode 100644
index 40bdd0c0cc..0000000000
--- a/.cursor/skills/review-curator-pr/reference.md
+++ /dev/null
@@ -1,86 +0,0 @@
-# Reference - NeMo Curator PR review
-
-All paths are relative to the repository root. The skill is self-contained: it
-relies only on this repository plus the GitHub CLI (`gh`).
-
-## Canonical contracts (`.cursor/rules/`)
-
-These always-on rules are the authoritative description of how stages and
-pipelines must be written. Cite them when a PR deviates.
-
-| Rule | Covers |
-|------|--------|
-| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
-| `composite-stage-patterns.mdc` | `CompositeStage` decomposition |
-| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
-| `executors.mdc` | Xenna / Ray Data backends and when to use each |
-| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
-| `pipeline-structure.mdc` | Pipeline assembly |
-| `modality-structure.mdc` | Where modality code/tests/tutorials live |
-| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
-
-## Framework code (diff ground truth)
-
-- `nemo_curator/stages/base.py` - `ProcessingStage` base class.
-- `nemo_curator/backends/` - `base.py`, `xenna/`, `ray_data/`,
-  `ray_actor_pool/`; adapters define how stages run on each backend.
-- `nemo_curator/pipeline/pipeline.py` - pipeline construction and guards.
-- `nemo_curator/tasks/` - task types.
-
-## Modality guides
-
-| Modality | Stage README | Tutorials |
-|----------|--------------|-----------|
-| audio | `nemo_curator/stages/audio/README.md` | `tutorials/audio/` |
-| interleaved | `nemo_curator/stages/interleaved/README.md` | `tutorials/interleaved/` |
-| text / image / video / math / synthetic | (stage code under `nemo_curator/stages/`) | `tutorials/{text,image,video,math,synthetic}/` |
-
-## Performance / benchmarking
-
-- `benchmarking/README.md` - how benchmarks are wired and run.
-- `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` - modality
-  perf expectations. Standalone benchmark scripts should be integrated here, not
-  left freestanding.
-
-## Contribution rules
-
-- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test
-  requirements, coverage threshold, copyright header.
-
-## GitHub data via `gh`
-
-`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir:
-
-| File | Endpoint |
-|------|----------|
-| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
-| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
-| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
-| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
-| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
-| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
-| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
-
-The REST inline-comments endpoint does not expose resolve/outdate state; the
-GraphQL thread payload does. The builder joins them by comment `databaseId`
-(== REST `id`), falling back to a `(path, body-prefix)` match when an older
-thread dump lacks `databaseId`.
-
-GraphQL query used by the pull script:
-
-```graphql
-query($owner:String!,$repo:String!,$pr:Int!){
-  repository(owner:$owner,name:$repo){
-    pullRequest(number:$pr){
-      reviewThreads(first:100){ nodes{
-        id isResolved isOutdated isCollapsed line originalLine path
-        comments(first:50){ nodes{ databaseId } }
-      }}}}}
-```
-
-## Scripts
-
-- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
-- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue.
-
-Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).

From 4de281f41863aa8bb39443a0bd7e70e7f7119083 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 14:42:17 +0530
Subject: [PATCH 03/10] Clarify review-curator-audio-pr is a reviewer tool for
 reviewing others PRs

Reframe SKILL.md, templates.md, scripts and generated digest/threads output so it is unambiguous the skill is used by a reviewer to review someone elses audio Curator PR, not by an author responding to reviewers.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 132 +++++++++---------
 .../review-curator-audio-pr/reference.md      |   2 +-
 .../review-curator-audio-pr/scripts/README.md |  11 +-
 .../scripts/build_digest.py                   |  26 ++--
 .../review-curator-audio-pr/templates.md      | 104 +++++++-------
 5 files changed, 145 insertions(+), 130 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index 646902b0fb..345bc07af7 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -1,57 +1,60 @@
 ---
 name: review-curator-audio-pr
 description: >-
-  Review NVIDIA-NeMo/Curator audio-modality pull requests - audio stages
-  (readers/writers, preprocessing, VAD, ASR, speaker diarization, filtering,
-  segmentation, tagging, ALM), audio tutorials, and audio benchmarks. Pulls
-  fresh PR data with the GitHub CLI, diffs the changed code, applies Curator's
-  audio stage contracts and contribution standards, and emits a structured
-  review digest plus a paste-ready comment queue. Use when a contributor or
-  maintainer asks to review, re-review, or triage an audio Curator PR (e.g.
-  "review audio PR 1967", "what's still open on the diarization PR"), or to
-  draft replies to reviewer comments on an audio PR.
+  Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is
+  a reviewer's tool: given a PR number, it pulls the PR, diffs the changed audio
+  code, applies Curator's audio stage contracts and contribution standards,
+  shows what other reviewers have already raised (so you don't duplicate), and
+  helps you produce a structured set of review findings (P0-P3) to post as
+  review comments. Use when you are assigned or pick up an audio Curator PR and
+  need to review it (e.g. "review audio PR 1967", "do a review of this Curator
+  audio PR", "what should I flag on PR 1898"). It is NOT for a PR author
+  responding to a review.
 ---
 
 # Review NeMo Curator audio PRs
 
-Produce a defensible, reproducible review of an audio-modality
-`NVIDIA-NeMo/Curator` PR from a fresh GitHub pull plus the code diff. Audience:
-the PR author (who must respond to reviewers and land the change) and
-maintainers triaging the audio queue.
+**Use this when you are the reviewer and need to review someone else's audio
+Curator PR.** Given a PR number, it helps you understand the change, see what
+other reviewers already said, and write your own review. It does not help a PR
+author reply to reviewers - it helps you *produce* the review.
 
-This skill is scoped to the **audio** modality: code under
-`nemo_curator/stages/audio/`, `tutorials/audio/`, audio tests under
-`tests/stages/audio/`, and audio benchmarks. For non-audio PRs, the stage-
-contract lenses still apply but the audio-specific guidance below will not.
+Scope: the **audio** modality - code under `nemo_curator/stages/audio/`,
+`tutorials/audio/`, audio tests under `tests/stages/audio/`, and audio
+benchmarks. For non-audio PRs the stage-contract lenses still apply, but the
+audio-specific guidance below will not.
 
 Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
 local checkout of this repository (the one containing this skill).
 
-## Two deliverables (always)
+## What you produce
 
-Each review run writes two Markdown files into a scratch directory (default
-`.curator-pr-review/`, override with `--outdir`):
+Your review = a set of **findings** (P0-P3), each tied to a `path:line` on the
+PR's current head, that you post as PR review comments. The skill writes two
+helper files into a scratch directory (default `.curator-pr-review/`, override
+with `--outdir`) to support that:
 
-1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state,
-   commits, changed-file table, reviews, every inline comment grouped by file
-   with OPEN / OUTDATED / RESOLVED status, and issue comments.
-2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root
-   threads that need an author reply or code change, paste-ready, with stale
-   threads listed at the bottom for traceability.
+1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: PR
+   state, commits, changed-file table, and every existing review/comment grouped
+   by file with OPEN / OUTDATED / RESOLVED status. Read it to understand the PR
+   and to see what other reviewers already raised so you don't repeat them; add
+   your own findings at the bottom.
+2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**:
+   a condensed list of the threads other reviewers left that are still
+   unresolved on the current head. Scan it before writing your own comments.
 
-See [templates.md](templates.md) for the exact section layout. Keeping the
-structure stable makes diffs across review passes mechanical, so a re-review
-only surfaces what changed.
+See [templates.md](templates.md) for the exact layout, including how to write up
+your findings.
 
 ## Workflow
 
 ```
 - [ ] 1. Identify the PR and confirm it touches audio paths
 - [ ] 2. Pull fresh GitHub data (gh)
-- [ ] 3. Check out the PR head and diff against the base
-- [ ] 4. Read the changed code through the audio review lenses
-- [ ] 5. Generate the digest + comment queue
-- [ ] 6. Classify findings P0-P3 and (optionally) draft replies
+- [ ] 3. Check out the PR head and read the diff
+- [ ] 4. Review the changed code through the audio review lenses
+- [ ] 5. Generate the digest + prior-threads file for context
+- [ ] 6. Write up your findings (P0-P3) and post them as review comments
 ```
 
 ### Step 1 - Identify the PR
@@ -59,14 +62,11 @@ only surfaces what changed.
 `gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio
 paths (`nemo_curator/stages/audio/`, `tutorials/audio/`, `tests/stages/audio/`,
 `benchmarking/AUDIO_PROFILING.md` / `ALM_BENCHMARK.md`); if it does not, this
-audio skill is the wrong lens. If a prior review digest for this PR exists, read
-it first and record its head SHA as the baseline so commits and comments can be
-marked NEW vs already-seen.
+audio skill is the wrong lens.
 
 ### Step 2 - Pull fresh GitHub data
 
-Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so
-prior pulls are preserved for delta analysis):
+Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots):
 
 ```bash
 .cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N>
@@ -74,66 +74,70 @@ prior pulls are preserved for delta analysis):
 
 It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
 `comments`, `files`, `commits`) and the GraphQL review threads, which carry the
-`isResolved` / `isOutdated` flags the REST inline endpoint omits.
+`isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this
+so you can see prior review activity, not because you own the PR.
 
-### Step 3 - Diff the code
+### Step 3 - Read the diff
 
 ```bash
 gh pr checkout <N>                       # or: git fetch origin pull/<N>/head:pr-<N>
-git diff origin/main...HEAD             # changed files on this PR
+git diff origin/main...HEAD             # the change you are reviewing
 ```
 
 Always review the actual diff, never the comment text alone. Use `main` as the
 baseline for "how the audio stages already do this".
 
-### Step 4 - Read through the audio review lenses
+### Step 4 - Review through the audio lenses
 
 Apply the lenses in [recurring-themes.md](recurring-themes.md): stage contracts,
-setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo,
-model utils), secret-safe logging, waveform/tensor memory and manifest
-serialization, tarred/sharded audio I/O, sample-rate and metadata propagation,
+setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo, model
+utils), secret-safe logging, waveform/tensor memory and manifest serialization,
+tarred/sharded audio I/O, sample-rate and metadata propagation,
 streaming/throughput, tutorials/docs, tests/coverage, and PR reviewability.
 
 These build on the project's always-on rules in `.cursor/rules/`
 (`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
 `executors`, `resources-configuration`, `pipeline-structure`,
 `modality-structure`, `coding-standards`) and the audio stage guide at
-`nemo_curator/stages/audio/README.md`. Treat those as canonical and cite them
-when a change deviates.
+`nemo_curator/stages/audio/README.md`. Cite these when the PR deviates.
 
-### Step 5 - Generate the two files
+### Step 5 - Generate the context files
 
 ```bash
 .cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
 ```
 
-The builder joins the pulled JSON and classifies each comment by thread status:
-**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**.
-The comment queue lists only OPEN root threads.
+The builder joins the pulled JSON and classifies each existing comment by thread
+status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head),
+**RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer
+already made.
 
-### Step 6 - Findings and replies
+### Step 6 - Write up and post your findings
 
-In the digest's findings section, classify each issue by severity:
+Record each issue you found as a finding, classified by severity, and post them
+as a PR review (inline comments for specific lines, a top-level summary for
+overall verdict). Each finding should cite the exact `path:line-range` on the
+current head and propose a concrete fix.
 
 - **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a
   manifest writer that crashes on a kept waveform), a secret (HF token) leaked
   into logs, or an import that breaks a supported install.
-- **P1** - fix before merge: undeclared audio runtime dependencies, missing
-  required tests/coverage, environment-specific I/O in a reusable reader, a PR
-  too large to review safely.
-- **P2** - should fix: duplicated pipeline construction, inefficient per-task
+- **P1** - should fix before merge: undeclared audio runtime dependencies,
+  missing required tests/coverage, environment-specific I/O in a reusable
+  reader, a PR too large to review safely.
+- **P2** - worth fixing: duplicated pipeline construction, inefficient per-task
   writer batching, incomplete throughput metrics, documented-but-unwired config.
 - **P3** - nice to have: temporary model/compat workarounds needing a test +
   comment, debug-only knobs (e.g. per-shard utterance caps) to document.
 
-When drafting replies, cite the exact `path:line-range` on the current head,
-state whether the point is already addressed (with the commit SHA) or will be,
-and keep each reply self-contained. See [templates.md](templates.md) section C.
+See [templates.md](templates.md) section C for how to phrase review comments.
 
-## Conventions to enforce (from CONTRIBUTING.md and rules)
+## What to check the PR against (CONTRIBUTING.md and rules)
 
-- **DCO sign-off** is required on every commit (`git commit -s`).
-- **Tests** must accompany source changes; GPU-only audio tests use
+A compliant audio PR must satisfy these; flag any that are missing:
+
+- **DCO sign-off** on every commit (`git commit -s`).
+- **Tests** accompany source changes; GPU-only audio tests use
   `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
   and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`.
 - **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
@@ -151,5 +155,5 @@ Full index in [reference.md](reference.md). Quick map:
 | Audio tutorials | `tutorials/audio/` |
 | Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
 | Audio perf expectations | `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` |
-| Contribution rules | `CONTRIBUTING.md` |
+| Contribution rules to check against | `CONTRIBUTING.md` |
 | Live PR data | `gh` (see scripts) |
diff --git a/.cursor/skills/review-curator-audio-pr/reference.md b/.cursor/skills/review-curator-audio-pr/reference.md
index b6ccf9ee7a..42aac6f78f 100644
--- a/.cursor/skills/review-curator-audio-pr/reference.md
+++ b/.cursor/skills/review-curator-audio-pr/reference.md
@@ -100,6 +100,6 @@ thread dump lacks `databaseId`.
 ## Scripts
 
 - `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
-- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue.
+- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the working digest + the prior-open-threads context file (you add your own findings).
 
 Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md
index b729dee15c..d270082407 100644
--- a/.cursor/skills/review-curator-audio-pr/scripts/README.md
+++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md
@@ -1,5 +1,9 @@
 # Scripts - review-curator-audio-pr
 
+Helpers for a **reviewer** reviewing someone else's audio Curator PR. They
+pull the PR and summarise existing review activity as context; you write the
+actual review findings.
+
 Both require the GitHub CLI (`gh`) authenticated against github.com. They write
 to a scratch directory (default `.curator-pr-review/`) that is safe to delete;
 do not commit its contents.
@@ -20,8 +24,11 @@ Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
 .cursor/skills/review-curator-audio-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
 ```
 
-Joins the `pr<N>_*_latest.json` files and writes
-`curator_pr<N>_fresh_review_<date>.md` + `curator_pr<N>_github_comment_queue_<date>.md`.
+Joins the `pr<N>_*_latest.json` files and writes two context files:
+`curator_pr<N>_fresh_review_<date>.md` (working digest: PR state, diff, and
+existing reviews/comments with OPEN/OUTDATED/RESOLVED status, plus a placeholder
+for your findings) and `curator_pr<N>_github_comment_queue_<date>.md` (the
+still-open threads other reviewers left, so you don't duplicate them).
 
 - `--prev-head` sets the SHA in the "commits since prior reviewed head" header.
 - `--baseline-ts <TS>` (a prior pull's timestamp suffix) marks
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
index 42f654d4e9..8c129103e1 100755
--- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
@@ -160,7 +160,7 @@ def base(kind):
         p(d, f"| `{f['filename']}` | +{f['additions']} -{f['deletions']} | {f['status']} |")
     p(d, "")
 
-    p(d, "## Reviews", "")
+    p(d, "## Existing reviews (by other reviewers)", "")
     review_by_id = {r["id"]: r for r in reviews}
     for r in sorted(reviews, key=lambda x: x.get("submitted_at") or ""):
         marker = " (NEW)" if base_review_ids and r["id"] not in base_review_ids else ""
@@ -172,7 +172,7 @@ def base(kind):
             p(d, "", shorten(body, 1500))
         p(d, "")
 
-    p(d, "## Inline review comments", "")
+    p(d, "## Existing inline comments (by other reviewers)", "")
     p(d, f"Total: {len(review_comments)} comments across {idx['nthreads'] or '?'} threads.", "")
     by_status = {"OPEN": [], "OUTDATED": [], "RESOLVED": [], "ORPHAN": []}
     for c in review_comments:
@@ -205,24 +205,32 @@ def base(kind):
             p(d, f"  > {body}", "")
     p(d, "")
 
-    p(d, "## Top-level issue comments", "")
+    p(d, "## Existing issue comments (by other reviewers)", "")
     for c in sorted(issue_comments, key=lambda x: x["created_at"]):
         new = " NEW" if base_issue_ids and c["id"] not in base_issue_ids else ""
         p(d, f"### #{c['id']} by @{c['user']['login']}  {c['created_at']}{new}", "")
         p(d, shorten(c.get("body") or "", 1500), "")
 
+    p(d, "## My findings (your review)", "")
+    p(d, "_Add your findings here as you review. Classify each P0-P3, cite "
+         "path:line on the current head, and propose a concrete fix. See "
+         "templates.md section C._", "")
+    p(d, "## Verdict", "")
+    p(d, "_APPROVE / COMMENT / REQUEST CHANGES + blockers._", "")
+
     digest_path = outdir / f"curator_pr{pr}_fresh_review_{date_us}.md"
     digest_path.write_text("\n".join(d) + "\n", encoding="utf-8")
     print(f"wrote {digest_path}  ({digest_path.stat().st_size} bytes)")
 
     # ----- COMMENT QUEUE (open root threads only) -----
     q = []
-    p(q, f"# Curator PR {pr} GitHub Comment Queue - {today}", "")
+    p(q, f"# Curator PR {pr} Open Review Threads - {today}", "")
     p(q, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "")
     p(q, f"Current PR head: `{head_oid}`", "")
-    p(q, "Open inline review threads from the most recent reviewer pass that need an "
-         "author response or a code change. Outdated/resolved threads are listed at "
-         "the bottom for traceability only.", "")
+    p(q, "Threads other reviewers left that are still unresolved on the current "
+         "head. Scan these before adding your own comments so you do not duplicate, "
+         "and check whether the author addressed them. Stale (outdated/resolved) "
+         "threads are listed at the bottom for context only.", "")
 
     def is_root(c):
         return (thread_meta(idx, c) or {}).get("is_first_in_thread", True)
@@ -233,10 +241,10 @@ def is_root(c):
         meta = thread_meta(idx, c) or {}
         line = c.get("line") or c.get("original_line")
         new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else ""
-        p(q, f"## Comment {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "")
+        p(q, f"## Thread {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "")
         p(q, f"Thread: {c['html_url']}", "")
         if meta.get("thread_comment_count", 1) > 1:
-            p(q, f"Thread has {meta['thread_comment_count']} comments - see digest for replies.", "")
+            p(q, f"Thread has {meta['thread_comment_count']} comments - see the digest for the full thread.", "")
         p(q, f"File: `{c['path']}`", "")
         p(q, f"Line: `{line}`", "")
         p(q, "Reviewer text:", "")
diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md
index 0cc6a86db6..f789a09aad 100644
--- a/.cursor/skills/review-curator-audio-pr/templates.md
+++ b/.cursor/skills/review-curator-audio-pr/templates.md
@@ -1,13 +1,15 @@
 # Output templates
 
-`scripts/build_digest.py` emits both files automatically; these templates show
-the exact structure so hand edits and the added findings/verdict narrative stay
-consistent. Replace `<...>` placeholders. Keep the structure stable across
-review passes so re-reviews diff cleanly.
+You are the reviewer. `scripts/build_digest.py` generates files A and B as
+context; you author the findings (section C) and post them as your review.
+Replace `<...>` placeholders. Keep the structure stable across review passes so
+a re-review diffs cleanly and only surfaces what changed.
 
-## A. Fresh review digest
+## A. Working digest (generated, then you append findings)
 
-File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`
+File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. The generated part gives you
+PR state, the diff, and all existing review activity. Add your own **Findings**
+section at the end - that is your review.
 
 ```markdown
 # Curator PR <N> Fresh Review - <YYYY-MM-DD>
@@ -18,117 +20,111 @@ Current PR head reviewed: `<headRefOid>`
 
 Base recorded by GitHub metadata: `<baseRefOid>`
 
-Previous reviewed head: `<prior_head_sha>`   # if a prior review exists
-
 ## PR state at review time
 
 | Field | Value |
 |---|---|
 | state | <OPEN/CLOSED/MERGED> |
-| isDraft | <bool> |
 | reviewDecision | <APPROVED/CHANGES_REQUESTED/none> |
-| mergeable / mergeStateStatus | <...> / <...> |
 | changedFiles | <n> |
 | additions / deletions | +<a> / -<d> |
 | commits | <n> |
 | inline review comments total | <n> |
 | reviews submitted | <n> |
-| top-level issue comments | <n> |
 | updatedAt | <iso> |
 
-## Commits since prior reviewed head `<prior8>` (<k> new)
+## Commits (<k>)
 
 ```
 <sha8>  <subject>  (<date>)
 ```
 
-Files changed in current PR head (`git diff <base>..<head>`):
+Files changed (`git diff <base>..<head>`):
 
 | File | +/- | status |
 |---|---|---|
 | `<path>` | +<a> -<d> | <added/modified> |
 
-## Reviews
+## Existing reviews (by other reviewers)
 
 ### #<id> by @<login>  state=<STATE>  commit=<sha8>  submitted=<iso>
 <body or empty>
 
-## Inline review comments
+## Existing inline comments (by other reviewers)
 
 Total: <n> comments across <m> threads.
 
-- OPEN: <n>
+- OPEN: <n>     # still unresolved - do NOT duplicate these
 - OUTDATED: <n>
 - RESOLVED: <n>
-- ORPHAN: <n>
 
 ### `<path>`
 
-- **#<id>** @<login> <iso>  line=<L> commit=<sha8> status=**<OPEN/OUTDATED/RESOLVED>**  thread_comments=<k> (<root/reply>) review_state=<STATE>
+- **#<id>** @<login> <iso>  line=<L> status=**<OPEN/OUTDATED/RESOLVED>**  ...
   url: <html_url>
-  > <body, truncated, each line prefixed with "> ">
-
-## Top-level issue comments
-
-### #<id> by @<login>  <iso>
-<body, truncated>
+  > <body, truncated>
 
-## Findings (reviewer analysis)
+## My findings (your review)
 
 ### P0: <title>
-<evidence: path:line on current head> - <why blocker> - <recommended fix + test>
+<evidence: path:line on current head> - <why it's a blocker> - <concrete fix>
 
 ### P1 / P2 / P3: <title>
 <...>
 
 ## Verdict
-<merge-ready or not, and the blockers>
+<APPROVE / COMMENT / REQUEST CHANGES, and the blockers if any>
 ```
 
-## B. GitHub comment queue
+## B. Prior open threads (generated context)
 
-File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. Only OPEN root
-threads needing an author response or code change; stale threads at the bottom.
+File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. The still-unresolved
+threads other reviewers already opened, so you can scan prior feedback before
+adding your own. This is **context, not your output** - you are not replying to
+these as the author; you read them to avoid duplicating and to check whether the
+author addressed them.
 
 ```markdown
-# Curator PR <N> GitHub Comment Queue - <YYYY-MM-DD>
-
-Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N>
+# Curator PR <N> Open Review Threads - <YYYY-MM-DD>
 
 Current PR head: `<headRefOid>`
 
-## Comment <i> - <path>:<line> by @<login>
+Threads other reviewers left that are still unresolved on the current head.
 
-Thread: <html_url>
+## Thread <i> - <path>:<line> by @<login>
 
-File: `<path>`
-
-Line: `<line>`
-
-Reviewer text:
+Link: <html_url>
 
 ```text
 <verbatim reviewer body>
 ```
 
-## Stale (outdated/resolved) comments
-
+## Stale (outdated/resolved) threads
 - [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body>
 ```
 
-## C. Reply drafting style
+## C. Writing your review comments
+
+Turn each finding into a review comment you post on the PR. For each:
 
-For each OPEN comment the user wants a paste-ready reply for:
+- Anchor it to the exact `path:line-range` on the current head.
+- Lead with severity (P0-P3) and a one-line statement of the problem.
+- Explain *why* it matters (cite a `.cursor/rules/*.mdc` contract or
+  `stages/audio/README.md` when the change deviates).
+- Propose a concrete fix, ideally with a `suggestion` block.
+- Before posting, check the "OPEN" list in file A - if another reviewer already
+  raised it, don't repeat it (optionally +1 their thread instead).
 
-- Acknowledge briefly.
-- State status: already addressed (cite the commit SHA) or will-fix.
-- Cite the exact `path:line-range` on the current head as evidence.
-- Describe the concrete mechanism, not just "fixed".
-- Keep each reply self-contained; the reviewer reads it in isolation.
+Example review comment:
 
-Example:
+> **P0 (blocker).** `sharded_manifest_writer.py:101` calls `json.dumps(record)`
+> while `record` can still hold the waveform `ndarray` when `keep_waveform=True`,
+> which raises `TypeError: Object of type ndarray is not JSON serializable` and
+> fails the whole shard. Drop the waveform key before serialising and guard
+> `json.dumps` so a stray non-serialisable value names the offending key.
+> Please also add a test under `tests/stages/audio/io/` covering
+> `keep_waveform=True`.
 
-> Thanks for flagging. Addressed on the current head (`<sha8>`): the writer now
-> drops the waveform key before serialisation and guards `json.dumps`, so a
-> non-serialisable value reports the offending key instead of failing the batch.
-> Citation: `nemo_curator/stages/audio/io/sharded_manifest_writer.py:96-111`.
+Then post the set as one PR review: inline comments for line-specific findings,
+a top-level summary carrying the overall verdict (APPROVE / COMMENT / REQUEST
+CHANGES).

From eceb1c992d73bb7aeb965e7babbbcffb66246d2a Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 15:08:54 +0530
Subject: [PATCH 04/10] Add repo auto-detect, reviewer prompt, audio knowledge
 sources, and post-1608 PR corpus

ensure_repo.sh reuses an existing checkout or shallow-clones (no full history). SKILL.md gains an explicit human-reviewer prompt and a repo-locate step. recurring-themes.md and reference.md are merged into knowledge-sources.md, with each review lens linked to the audio code, README section, .cursor/rules contract, and the developer-guide slides. Adds pull_audio_pr_corpus.sh + build_corpus.py to consolidate reviewer feedback from audio PRs after 1608 as pre-review context.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 127 ++++---
 .../knowledge-sources.md                      | 309 ++++++++++++++++++
 .../recurring-themes.md                       | 142 --------
 .../review-curator-audio-pr/reference.md      | 105 ------
 .../review-curator-audio-pr/scripts/README.md |  43 ++-
 .../scripts/build_corpus.py                   | 159 +++++++++
 .../scripts/ensure_repo.sh                    |  57 ++++
 .../scripts/pull_audio_pr_corpus.sh           |  84 +++++
 8 files changed, 718 insertions(+), 308 deletions(-)
 create mode 100644 .cursor/skills/review-curator-audio-pr/knowledge-sources.md
 delete mode 100644 .cursor/skills/review-curator-audio-pr/recurring-themes.md
 delete mode 100644 .cursor/skills/review-curator-audio-pr/reference.md
 create mode 100755 .cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
 create mode 100755 .cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh
 create mode 100755 .cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index 345bc07af7..30078000b0 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -2,14 +2,14 @@
 name: review-curator-audio-pr
 description: >-
   Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is
-  a reviewer's tool: given a PR number, it pulls the PR, diffs the changed audio
-  code, applies Curator's audio stage contracts and contribution standards,
-  shows what other reviewers have already raised (so you don't duplicate), and
-  helps you produce a structured set of review findings (P0-P3) to post as
-  review comments. Use when you are assigned or pick up an audio Curator PR and
-  need to review it (e.g. "review audio PR 1967", "do a review of this Curator
-  audio PR", "what should I flag on PR 1898"). It is NOT for a PR author
-  responding to a review.
+  a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo,
+  pulls the PR, diffs the changed audio code, applies Curator's audio stage
+  contracts and contribution standards, shows what other reviewers have already
+  raised (so you don't duplicate), and helps you produce a structured set of
+  review findings (P0-P3) to post as review comments. Use when you are assigned
+  or pick up an audio Curator PR and need to review it (e.g. "review audio PR
+  1967", "do a review of this Curator audio PR", "what should I flag on PR
+  1898"). It is NOT for a PR author responding to a review.
 ---
 
 # Review NeMo Curator audio PRs
@@ -24,8 +24,26 @@ Scope: the **audio** modality - code under `nemo_curator/stages/audio/`,
 benchmarks. For non-audio PRs the stage-contract lenses still apply, but the
 audio-specific guidance below will not.
 
-Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
-local checkout of this repository (the one containing this skill).
+Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and
+`git`. You do **not** need to pre-clone the repo - step 0 reuses any checkout
+you already have and only shallow-clones
+[NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found.
+
+## The reviewer prompt
+
+A human reviewer invokes this skill with a prompt like:
+
+> **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or
+> shallow-clone the repo, pull the PR and prior review activity, apply the audio
+> review lenses, and give me a review digest plus my own findings (P0-P3) with
+> exact `path:line` evidence and concrete fixes, ready to post as PR comments.
+> Skip anything other reviewers already raised."**
+
+Variants: *"re-review PR <N> and show only what changed since the last review"*,
+*"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and
+review the smallest one first"*. First-time in a repo, you can prime context
+with *"build the post-#1608 audio review corpus first, then review PR <N>"*
+(see [knowledge-sources.md](knowledge-sources.md) section 4).
 
 ## What you produce
 
@@ -36,70 +54,80 @@ with `--outdir`) to support that:
 
 1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: PR
    state, commits, changed-file table, and every existing review/comment grouped
-   by file with OPEN / OUTDATED / RESOLVED status. Read it to understand the PR
-   and to see what other reviewers already raised so you don't repeat them; add
-   your own findings at the bottom.
+   by file with OPEN / OUTDATED / RESOLVED status, plus a placeholder for your
+   findings. Read it to understand the PR and to see what other reviewers already
+   raised so you don't repeat them; record your findings at the bottom.
 2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**:
    a condensed list of the threads other reviewers left that are still
    unresolved on the current head. Scan it before writing your own comments.
 
-See [templates.md](templates.md) for the exact layout, including how to write up
-your findings.
+See [templates.md](templates.md) for the exact layout and how to phrase your
+review comments. The review lenses and every doc/code reference live in
+[knowledge-sources.md](knowledge-sources.md).
 
 ## Workflow
 
 ```
+- [ ] 0. Locate or shallow-clone the Curator repo
 - [ ] 1. Identify the PR and confirm it touches audio paths
 - [ ] 2. Pull fresh GitHub data (gh)
-- [ ] 3. Check out the PR head and read the diff
+- [ ] 3. Read the diff
 - [ ] 4. Review the changed code through the audio review lenses
 - [ ] 5. Generate the digest + prior-threads file for context
 - [ ] 6. Write up your findings (P0-P3) and post them as review comments
 ```
 
+### Step 0 - Locate or shallow-clone the repo
+
+Don't clone unnecessarily. This helper checks whether you are already inside a
+Curator checkout, searches the current directory tree for one, and only then
+shallow-clones (`--depth 1`, no full history):
+
+```bash
+eval "$(scripts/ensure_repo.sh | tail -1)"   # sets CURATOR_REPO=<path>
+cd "$CURATOR_REPO"
+```
+
+If you are already in the checkout that contains this skill, you can skip this.
+
 ### Step 1 - Identify the PR
 
 `gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio
-paths (`nemo_curator/stages/audio/`, `tutorials/audio/`, `tests/stages/audio/`,
-`benchmarking/AUDIO_PROFILING.md` / `ALM_BENCHMARK.md`); if it does not, this
-audio skill is the wrong lens.
+paths (`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`,
+`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks); if it does not,
+this audio skill is the wrong lens.
 
 ### Step 2 - Pull fresh GitHub data
 
-Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots):
-
 ```bash
 .cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N>
 ```
 
-It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
 `comments`, `files`, `commits`) and the GraphQL review threads, which carry the
 `isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this
-so you can see prior review activity, not because you own the PR.
+to see prior review activity, not because you own the PR.
 
 ### Step 3 - Read the diff
 
 ```bash
-gh pr checkout <N>                       # or: git fetch origin pull/<N>/head:pr-<N>
-git diff origin/main...HEAD             # the change you are reviewing
+gh pr diff <N> --repo NVIDIA-NeMo/Curator        # works on a shallow clone
+# optional, to run/inspect locally: gh pr checkout <N>
 ```
 
-Always review the actual diff, never the comment text alone. Use `main` as the
-baseline for "how the audio stages already do this".
+`gh pr diff` fetches the patch straight from GitHub, so it works even on the
+shallow clone from step 0. Always review the actual diff, never the comment text
+alone. Use `main` as the baseline for "how the audio stages already do this".
 
 ### Step 4 - Review through the audio lenses
 
-Apply the lenses in [recurring-themes.md](recurring-themes.md): stage contracts,
-setup/teardown lifecycle, audio optional-dependency hygiene (vLLM, NeMo, model
-utils), secret-safe logging, waveform/tensor memory and manifest serialization,
-tarred/sharded audio I/O, sample-rate and metadata propagation,
-streaming/throughput, tutorials/docs, tests/coverage, and PR reviewability.
-
-These build on the project's always-on rules in `.cursor/rules/`
-(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
-`executors`, `resources-configuration`, `pipeline-structure`,
-`modality-structure`, `coding-standards`) and the audio stage guide at
-`nemo_curator/stages/audio/README.md`. Cite these when the PR deviates.
+Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 -
+each lens links the audio code, README section, and `.cursor/rules` contract it
+governs: stage contracts, setup/teardown lifecycle, audio optional-dependency
+hygiene (vLLM, NeMo, model utils), secret-safe logging, waveform/tensor memory
+and manifest serialization, tarred/sharded audio I/O, sample-rate and metadata
+propagation, streaming/throughput, tutorials/docs, tests/coverage, and PR
+reviewability. Cite the linked source by name whenever the PR deviates.
 
 ### Step 5 - Generate the context files
 
@@ -114,10 +142,10 @@ already made.
 
 ### Step 6 - Write up and post your findings
 
-Record each issue you found as a finding, classified by severity, and post them
-as a PR review (inline comments for specific lines, a top-level summary for
-overall verdict). Each finding should cite the exact `path:line-range` on the
-current head and propose a concrete fix.
+Record each issue as a finding, classified by severity, and post them as a PR
+review (inline comments for specific lines, a top-level summary for the overall
+verdict). Each finding cites the exact `path:line-range` on the current head and
+proposes a concrete fix.
 
 - **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a
   manifest writer that crashes on a kept waveform), a secret (HF token) leaked
@@ -146,14 +174,7 @@ A compliant audio PR must satisfy these; flag any that are missing:
 
 ## Knowledge sources
 
-Full index in [reference.md](reference.md). Quick map:
-
-| Need | Source |
-|------|--------|
-| Audio stage guide | `nemo_curator/stages/audio/README.md` |
-| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` |
-| Audio tutorials | `tutorials/audio/` |
-| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
-| Audio perf expectations | `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` |
-| Contribution rules to check against | `CONTRIBUTING.md` |
-| Live PR data | `gh` (see scripts) |
+[knowledge-sources.md](knowledge-sources.md) is the single index: canonical docs
+(audio README, developer-guide slides, `.cursor/rules`, tutorials, benchmarks),
+the audio code map, the review lenses with per-concept code references, the
+post-#1608 PR corpus workflow, and the GitHub/`gh` data reference.
diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
new file mode 100644
index 0000000000..cb76080537
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
@@ -0,0 +1,309 @@
+# Audio PR review knowledge sources
+
+Read this before reviewing. It is the single index for an audio Curator review
+and has five parts:
+
+0. **Canonical docs** - the authoritative references to ground every finding.
+1. **Audio code map** - where each audio stage lives.
+2. **Review lenses** - what to check, each linked to the audio code, README
+   section, and `.cursor/rules` contract it governs.
+3. **Severity mapping** - P0-P3.
+4. **Pre-review corpus** - pull every audio PR after #1608 (open + closed) with
+   reviewer comments, so you learn the recurring feedback before you review.
+5. **GitHub data + scripts reference** - the `gh` endpoints and helper scripts.
+
+All paths are repo-relative. The skill is self-contained: this repository plus
+the GitHub CLI (`gh`). Scope: the **audio** modality.
+
+---
+
+## 0. Canonical docs (start here)
+
+| Source | What it gives you |
+|--------|-------------------|
+| `nemo_curator/stages/audio/README.md` | The audio stage **developer guide** (1000+ lines): CPU vs GPU stages, `process` / `process_batch`, `batch_size`, call chains, per-stage memory characteristics, end-to-end FLEURS & ALM traces, new-stage checklist. Read first. |
+| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). |
+| `nemo_curator/tasks/audio_task.py` | `AudioTask(Task[dict])` + `_AttrDict` task model (one manifest entry per task; dict keys exposed as attributes for `validate_input`). |
+| `nemo_curator/stages/base.py` | `ProcessingStage` contract: `inputs()`/`outputs()`, `process`/`process_batch`, `setup_on_node`/`setup`/`teardown`, `validate_input`. |
+| `CONTRIBUTING.md` | DCO sign-off, PR process, required tests, coverage threshold, copyright header. |
+| `.cursor/rules/*.mdc` | Always-on engineering contracts (table below). |
+| `tutorials/audio/README.md` + per-pipeline READMEs | Runnable reference pipelines (alm, audio_pretrain, callhome_diar, fleurs, readspeech, single_speaker_filter, tagging). |
+| `benchmarking/README.md`, `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | How audio/ALM benchmarks are wired and the perf expectations to hold a PR to. |
+| `tests/stages/audio/` | Test layout that mirrors `nemo_curator/stages/audio/`; the coverage bar. |
+
+### `.cursor/rules/` contracts
+
+| Rule | Covers |
+|------|--------|
+| `.cursor/rules/processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
+| `.cursor/rules/composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) |
+| `.cursor/rules/task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
+| `.cursor/rules/executors.mdc` | Xenna / Ray Data backends and when to use each |
+| `.cursor/rules/resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
+| `.cursor/rules/pipeline-structure.mdc` | Pipeline assembly |
+| `.cursor/rules/modality-structure.mdc` | Where modality code/tests/tutorials live |
+| `.cursor/rules/coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
+
+---
+
+## 1. Audio code map (diff ground truth)
+
+`nemo_curator/stages/audio/` is organized as:
+
+| Area | Path(s) |
+|------|---------|
+| Task model | `nemo_curator/tasks/audio_task.py` |
+| Base contract | `nemo_curator/stages/base.py` |
+| Common CPU stages (duration, value filter) | `nemo_curator/stages/audio/common.py` |
+| I/O (convert to DocumentBatch, segment extraction) | `nemo_curator/stages/audio/io/convert.py`, `nemo_curator/stages/audio/io/extract_segments.py` |
+| Preprocessing | `nemo_curator/stages/audio/preprocessing/mono_conversion.py`, `nemo_curator/stages/audio/preprocessing/concatenation.py` |
+| Inference - ASR | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` |
+| Inference - VAD | `nemo_curator/stages/audio/inference/vad/whisperx_vad.py` |
+| Inference - speaker diarization | `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py` |
+| Filtering (band, SigMOS, UTMOS) | `nemo_curator/stages/audio/filtering/band.py`, `nemo_curator/stages/audio/filtering/sigmos.py`, `nemo_curator/stages/audio/filtering/utmos.py` |
+| Segmentation | `nemo_curator/stages/audio/segmentation/vad_segmentation.py`, `nemo_curator/stages/audio/segmentation/speaker_separation.py` |
+| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py`, `nemo_curator/stages/audio/tagging/merge_alignment_diarization.py`, `nemo_curator/stages/audio/tagging/resample_audio.py`, `nemo_curator/stages/audio/tagging/split.py`, `nemo_curator/stages/audio/tagging/text/itn.py` |
+| Postprocessing | `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py` |
+| Metrics / throughput | `nemo_curator/stages/audio/metrics/wer.py`, `nemo_curator/stages/audio/metrics/bandwidth.py`, `nemo_curator/stages/audio/metrics/squim.py` |
+| Datasets (FLEURS, ReadSpeech) | `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/readspeech/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/file_utils.py` |
+| ALM (audio-language model data) | `nemo_curator/stages/audio/alm/alm_data_builder.py`, `nemo_curator/stages/audio/alm/alm_data_overlap.py`, `nemo_curator/stages/audio/alm/pretrain/pipeline.py`, `nemo_curator/stages/audio/alm/pretrain/io.py` |
+| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/audio_data_filter/audio_data_filter.py` |
+
+Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`,
+`ray_data/`, `ray_actor_pool/`); pipeline assembly:
+`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`.
+
+---
+
+## 2. Review lenses (each with code + doc references)
+
+Apply each lens to the diff and cite exact `path:line` evidence. When a change
+violates a contract, link it by name. The **References** under each lens are the
+canonical sources to cite.
+
+### 2.1 Stage contracts and reuse
+
+- An audio stage subclasses `ProcessingStage[AudioTask, AudioTask]` (or the
+  appropriate in/out task types) with explicit `inputs()` / `outputs()`, a
+  meaningful `name`, and appropriate `resources` / `batch_size` (set as class
+  attributes / dataclass fields, never the read-only `_name` / `_resources` /
+  `_batch_size` properties). Field order follows the base convention: `name`
+  first, then params, then `resources`, then `batch_size` last.
+- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in
+  `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it
+  runs on the driver). Batch-only stages raise `NotImplementedError` from
+  `process()` (the convention used by dedup stages).
+- Reuse existing audio readers, manifest writers, and shared model abstractions
+  instead of bespoke orchestration. Model a tarred/sharded reader as a
+  `CompositeStage` = a discovery stage + a shard-reader fan-out.
+- If `inputs()` does not declare a required key (e.g. `audio_filepath_key`,
+  `duration_key`), runtime validation won't check it, and a direct
+  `task.data[key]` access then raises an opaque `KeyError` instead of a clear
+  validation error. Declare required keys.
+- Fan-out shard readers need deterministic child IDs and must propagate
+  `_metadata` / `_stage_perf`; declare `ray_stage_spec()` with
+  `IS_FANOUT_STAGE: True`. First-stage shard discovery from an empty input
+  should constrain placement (e.g. one worker per node) so discovery isn't
+  duplicated across workers.
+- **References:** `nemo_curator/stages/base.py`; `nemo_curator/tasks/audio_task.py`; `nemo_curator/stages/audio/common.py`; `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`; rules `processing-stage-patterns.mdc`, `composite-stage-patterns.mdc`, `task-patterns.mdc`; README sections "Writing a CPU stage" / "Writing a GPU stage".
+
+### 2.2 Setup / teardown lifecycle and audio dependencies
+
+- `setup_on_node()` = once-per-node prep (model download / shared cache / output
+  truncation). `setup()` = per-worker init (model load, GPU allocation).
+  `teardown()` frees GPU memory. Don't conflate them; prefer `setup_on_node` for
+  anything that must run exactly once (avoids `_setup_done`-style guards).
+- Import heavy/optional audio deps (NeMo, vLLM, whisperx, pyannote, soundfile,
+  model-specific utils) lazily inside `setup()` or behind a guarded import,
+  never unconditionally at module top level. A top-level `import <optional_pkg>`
+  breaks `import nemo_curator.stages.audio...` on any install without that extra
+  (including CPU-only / Mac/ARM).
+- Declare the full audio runtime dependency set in the appropriate
+  `pyproject.toml` optional extra (e.g. a CUDA extra); add an import smoke test.
+  Don't rely on an external Docker image to supply deps a reusable stage needs.
+- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
+  shared across modalities (`transformers`, `huggingface-hub`, `accelerate`);
+  prefer compatible ranges.
+- Expose model knobs like `cache_dir` and `trust_remote_code` as parameters
+  rather than hardcoding them.
+- **References:** `nemo_curator/stages/audio/inference/asr/asr_nemo.py` (download-on-node / load-on-worker split, `cache_dir`); `nemo_curator/stages/audio/inference/vad/whisperx_vad.py`; `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `pyannote.py`; `nemo_curator/stages/audio/filtering/{utmos,sigmos,band}.py`; `pyproject.toml`; rules `resources-configuration.mdc`; README "Stage lifecycle".
+
+### 2.3 Secret-safe logging
+
+- Never log a full resolved config; audio tutorials commonly accept an HF token
+  / cloud credentials. Redact secret-like keys before logging (token, password,
+  secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset,
+  and add a test asserting the secret value never appears in logged output.
+- **References:** `tutorials/audio/*/` (token/credential inputs); any stage that
+  logs configuration; rules `coding-standards.mdc` (loguru).
+
+### 2.4 Waveform/tensor memory and manifest serialization
+
+- Drop waveform arrays / audio tensors from `task.data` once a stage no longer
+  needs them. A leaked numpy `ndarray` / torch tensor that reaches a JSON
+  manifest writer raises `TypeError: Object of type ndarray is not JSON
+  serializable` and can fail an entire shard/batch, or cause OOM at scale. Strip
+  by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap
+  `json.dumps` so a stray non-serializable value reports the offending key.
+- ALM stages that replace the whole dict must `task.data.clear()` +
+  `task.data.update(result)` to preserve the `_AttrDict` type. Remember
+  `_AttrDict.__setattr__` routes writes into the dict, so never stamp private
+  fields onto a task's data dict.
+- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for
+  non-regular members even after `isfile()` - check before `.read()`. Handle
+  resampling / channel-count / dtype mismatches explicitly.
+- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics".
+
+### 2.5 Tarred / sharded audio I/O
+
+- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
+  cluster-local paths) out of the generic reader/writer. Make remote access
+  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
+  opener/resolver, and isolate environment-specific support behind a small,
+  documented, optional component. For piped/streamed downloads use failure-aware
+  flags + retries so truncated streams fail loudly.
+- Cache the fsspec filesystem object in `setup_on_node` rather than re-creating
+  it per `process()` call (a new HTTP connection per manifest entry is costly on
+  S3/GCS at scale).
+- Manifest/shard writers that loop per task inside `process_batch()` reopen the
+  output repeatedly; group by shard and open each output once per batch. Preserve
+  `.done`/checkpoint markers so runs are resumable.
+- **References:** `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/datasets/{fleurs,readspeech}/create_initial_manifest.py`; `nemo_curator/stages/audio/datasets/file_utils.py`; `nemo_curator/stages/audio/alm/pretrain/io.py`; rules `composite-stage-patterns.mdc`.
+
+### 2.6 Sample-rate and metadata propagation
+
+- Resampling/mono-conversion/segmentation stages must update the sample-rate and
+  duration keys they change and propagate `_metadata` / `_stage_perf`; downstream
+  stages that assume a fixed sample rate must declare it in `inputs()`.
+- **References:** `nemo_curator/stages/audio/preprocessing/mono_conversion.py`; `nemo_curator/stages/audio/tagging/resample_audio.py`; `nemo_curator/stages/audio/segmentation/vad_segmentation.py`; `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py`.
+
+### 2.7 Streaming, throughput, benchmarking
+
+- Streaming execution with per-stage resource configuration is the default model
+  for scalable audio pipelines.
+- Emit stage metrics through the supported metrics path, JSON-shaped. For
+  multi-node audio runs include worker/node/GPU identity and latency
+  distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just
+  totals/averages.
+- Standalone audio benchmark scripts belong in the `benchmarking/` flow with
+  config entries and comparable parameters - not freestanding.
+- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`.
+
+### 2.8 Tutorials and docs
+
+- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public
+  APIs only - no private APIs, no machine-local absolute paths, no committed
+  noisy notebook output, no executor mismatch between a notebook and its CLI.
+- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
+  an `examples/` script; extract one shared builder both call.
+- Every config key documented for users must actually reach a stage; validate
+  config-to-stage mapping and fail (or warn) on unused keys.
+- Don't commit generated analysis/scratch docs into the source tree.
+- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `nemo_curator/stages/audio/README.md`.
+
+### 2.9 Tests and standards (CONTRIBUTING + coding-standards)
+
+- Source changes need accompanying tests; `tests/stages/audio/` mirrors
+  `nemo_curator/stages/audio/`. The project enforces 80% coverage under
+  `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures
+  under `tests/fixtures/audio/` and sample data under `tests/data/audio/`.
+- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
+- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
+- Every commit is DCO signed-off.
+- **References:** `CONTRIBUTING.md`; rules `coding-standards.mdc`, `modality-structure.mdc`; `tests/stages/audio/`.
+
+### 2.10 PR size and reviewability
+
+- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by
+  stage / module / tutorial / benchmark. If a single PR must remain, organize it
+  as if sliced: no duplicated builder logic, clear module ownership, a reviewable
+  commit sequence.
+
+### Smaller recurring nits
+
+- Missing copyright headers on new files.
+- Top-level imports that should be lazy/guarded (very common on audio model modules).
+- Long inline prompt/config strings that should be module constants.
+- Dataclass fields validated in `__post_init__` that could just be typed.
+- Dead conditionals / redundant truthiness checks.
+- Pytests added for tutorials that don't warrant them.
+- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features - document them as debug throttles.
+- `trust_remote_code=True` hardcoded on a model load - expose as a parameter.
+
+---
+
+## 3. Severity mapping
+
+| Severity | Meaning |
+|----------|---------|
+| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install |
+| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large |
+| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config |
+| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs |
+
+---
+
+## 4. Pre-review corpus: learn from post-#1608 audio PRs
+
+PR [#1608](https://github.com/NVIDIA-NeMo/Curator/pull/1608) (`AudioBatch ->
+AudioTask` redesign) reset the audio stage contracts. Reviewer feedback on audio
+PRs **after** #1608 is the best predictor of what to flag next. Before reviewing
+a new PR - especially the first time you review audio in this repo - build the
+consolidated corpus of that feedback:
+
+```bash
+# 1) discover audio PRs after #1608 (open + closed/merged) and pull their
+#    reviews + inline comments into .curator-pr-review/audio-corpus/
+.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
+
+# 2) render one consolidated, reviewer-comment corpus
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
+```
+
+`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are
+monotonic in time, so number > 1608 == opened after #1608), keeps the ones whose
+changed files touch audio paths, and pulls each one's reviews, inline comments,
+and issue comments. `build_corpus.py` writes
+`.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md`: one section per
+audio PR (number, title, state, author, link) with every reviewer comment
+verbatim, anchored to `path:line`, plus a recurring-themes tally.
+
+Use the corpus to (a) recognize patterns reviewers repeatedly raise (the lenses
+in section 2 came from exactly this), and (b) check whether the PR in front of
+you repeats a mistake already called out elsewhere. It is read-only context; it
+never auto-posts anything. Re-run periodically to keep it current.
+
+---
+
+## 5. GitHub data + scripts reference
+
+`scripts/ensure_repo.sh [CLONE_DIR]` - reuse an existing Curator checkout or
+shallow-clone one; prints `CURATOR_REPO=<path>` on its last line.
+
+`scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` pulls into the
+scratch outdir:
+
+| File | Endpoint |
+|------|----------|
+| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
+| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
+| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
+| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
+| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
+| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
+| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
+
+The REST inline-comments endpoint does not expose resolve/outdate state; the
+GraphQL thread payload does. The builder joins them by comment `databaseId`
+(== REST `id`), falling back to a `(path, body-prefix)` match when an older
+thread dump lacks `databaseId`.
+
+`scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]`
+renders the working digest + the prior-open-threads context file (you add your
+own findings).
+
+`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]`
+and `scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the
+post-#1608 corpus (section 4).
+
+Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete).
diff --git a/.cursor/skills/review-curator-audio-pr/recurring-themes.md b/.cursor/skills/review-curator-audio-pr/recurring-themes.md
deleted file mode 100644
index 38b5c826e9..0000000000
--- a/.cursor/skills/review-curator-audio-pr/recurring-themes.md
+++ /dev/null
@@ -1,142 +0,0 @@
-# Review lenses - recurring themes in Curator audio PRs
-
-Apply each lens to the diff and cite exact `path:line` evidence. These extend
-the always-on rules in `.cursor/rules/` and the audio guide at
-`nemo_curator/stages/audio/README.md`; when a change violates a rule, link it by
-name.
-
-## Stage contracts and reuse
-
-- An audio stage subclasses `ProcessingStage[InputTask, OutputTask]` with
-  explicit `inputs()` / `outputs()`, a meaningful `name`, and appropriate
-  `resources` / `batch_size` (set as class attributes / dataclass fields, never
-  as the read-only `_name` / `_resources` / `_batch_size` properties).
-- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in
-  `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it
-  runs on the driver).
-- Reuse existing audio readers, manifest writers, and shared model abstractions
-  instead of bespoke orchestration. Model a tarred/sharded reader as a
-  `CompositeStage` = a discovery stage plus a shard-reader fan-out.
-- If `inputs()` does not declare a required key (e.g. `waveform_key`,
-  `sample_rate_key`), runtime validation will not check it, and a direct
-  `task.data[key]` access then raises an opaque `KeyError` instead of a clear
-  validation error. Declare required keys.
-- Fan-out shard readers need deterministic child IDs and must propagate
-  `_metadata` / `_stage_perf`. First-stage shard discovery from an empty input
-  should constrain placement (e.g. one worker per node) so discovery is not
-  duplicated across every worker.
-
-## Setup / teardown lifecycle and audio dependencies
-
-- `setup_on_node()` = once-per-node prep (model download / shared cache).
-  `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees
-  GPU memory. Don't conflate them.
-- Import heavy/optional audio deps (vLLM, NeMo, model-specific utils,
-  soundfile/lhotse, fasttext) lazily inside `setup()` or behind a guarded
-  import, never unconditionally at module top level, when they ship only with an
-  optional extra. A top-level `import <optional_pkg>` breaks
-  `import nemo_curator.stages.audio...` on any install without that extra
-  (including CPU-only / Mac/ARM).
-- Declare the full audio runtime dependency set in the appropriate
-  `pyproject.toml` optional extra (e.g. an audio CUDA extra); add an import
-  smoke test under that extra. Don't rely on an external Docker image to supply
-  dependencies a reusable audio stage needs.
-- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
-  shared across modalities (`transformers`, `huggingface-hub`, `accelerate`);
-  prefer compatible ranges so other extras don't break.
-
-## Secret-safe logging
-
-- Never log a full resolved config; audio tutorials commonly accept an HF token
-  / cloud credentials. Redact secret-like keys before logging (token, password,
-  secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset,
-  and add a test asserting the secret value never appears in logged output.
-
-## Waveform/tensor memory and manifest serialization
-
-- Drop waveform arrays / audio tensors from `task.data` once a stage no longer
-  needs them. A leaked numpy `ndarray` or torch tensor that reaches a JSON
-  manifest writer raises `TypeError: Object of type ndarray is not JSON
-  serializable` and can fail the entire shard/batch, or cause OOM at scale.
-  Strip by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap
-  `json.dumps` so a stray non-serializable value reports the offending key.
-- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for
-  non-regular members even after `isfile()` - check before `.read()`. Handle
-  resampling / channel-count / dtype mismatches explicitly.
-
-## Tarred / sharded audio I/O
-
-- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
-  cluster-local paths) out of the generic reader/writer. Make remote access
-  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
-  opener/resolver, and isolate any environment-specific support behind a small,
-  documented, optional component. For piped/streamed downloads use
-  failure-aware flags + retries so truncated streams fail loudly.
-- Manifest/shard writers that loop per task inside `process_batch()` reopen the
-  output file repeatedly; group by shard and open each output once per batch.
-  Preserve `.done`/checkpoint markers so runs are resumable.
-
-## Streaming, throughput, benchmarking
-
-- Streaming execution with per-stage resource configuration is the default
-  model for scalable audio pipelines (see `executors.mdc`).
-- Emit stage metrics through the supported metrics path (`metrics/`), JSON-
-  shaped. For multi-node audio runs include worker/node/GPU identity and latency
-  distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just
-  totals/averages.
-- Standalone audio benchmark scripts belong in the `benchmarking/` flow
-  (`AUDIO_PROFILING.md`, `ALM_BENCHMARK.md`) with config entries and comparable
-  parameters - not freestanding.
-
-## Tutorials and docs
-
-- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public
-  APIs only - no private APIs, no machine-local absolute paths, no committed
-  noisy notebook output, no executor mismatch between a notebook and its CLI.
-  Prefer `uv` for environment setup if the tutorial standard does.
-- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
-  an `examples/` script; extract one shared builder both call.
-- Every config key documented for users must actually reach a stage; validate
-  config-to-stage mapping and fail (or warn) on unused keys.
-- Don't commit generated analysis/scratch docs into the source tree.
-
-## Tests and standards (CONTRIBUTING + coding-standards)
-
-- Source changes need accompanying tests; `tests/stages/audio/` mirrors
-  `nemo_curator/stages/audio/`. The project enforces 80% coverage under
-  `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures
-  under `tests/fixtures/audio/` and sample data under `tests/data/audio/`.
-- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
-- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
-- Every commit is DCO signed-off.
-
-## PR size and reviewability
-
-- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by
-  stage / module / tutorial / benchmark (e.g. reader+writer plumbing, then
-  inference stages, then postprocessing, then tutorial/config). If a single PR
-  must remain, organize it as if sliced: no duplicated builder logic, clear
-  module ownership, a reviewable commit sequence.
-
-## Smaller recurring nits
-
-- Missing copyright headers on new files.
-- Top-level imports that should be lazy/guarded (very common on audio model
-  modules).
-- Long inline prompt/config strings that should be module constants.
-- Dataclass fields validated in `__post_init__` that could just be typed.
-- Dead conditionals / redundant truthiness checks.
-- Pytests added for tutorials that don't warrant them.
-- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features -
-  document them as debug throttles.
-- `trust_remote_code=True` hardcoded on a model load - expose as a parameter so
-  security-conscious users can opt out.
-
-## Severity mapping
-
-| Severity | Meaning |
-|----------|---------|
-| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install |
-| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large |
-| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config |
-| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs |
diff --git a/.cursor/skills/review-curator-audio-pr/reference.md b/.cursor/skills/review-curator-audio-pr/reference.md
deleted file mode 100644
index 42aac6f78f..0000000000
--- a/.cursor/skills/review-curator-audio-pr/reference.md
+++ /dev/null
@@ -1,105 +0,0 @@
-# Reference - NeMo Curator audio PR review
-
-All paths are relative to the repository root. The skill is self-contained: it
-relies only on this repository plus the GitHub CLI (`gh`). Scope: the **audio**
-modality.
-
-## Primary audio guide
-
-- `nemo_curator/stages/audio/README.md` - audio stage contracts, `process` vs
-  `process_batch`, lazy/optional imports, metrics. Read this first for any
-  audio review.
-
-## Canonical contracts (`.cursor/rules/`)
-
-Always-on rules describing how stages and pipelines must be written. Cite them
-when a PR deviates.
-
-| Rule | Covers |
-|------|--------|
-| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
-| `composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) |
-| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
-| `executors.mdc` | Xenna / Ray Data backends and when to use each |
-| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
-| `pipeline-structure.mdc` | Pipeline assembly |
-| `modality-structure.mdc` | Where modality code/tests/tutorials live |
-| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
-
-## Audio stage map (diff ground truth)
-
-`nemo_curator/stages/audio/` is organized as:
-
-| Area | Path |
-|------|------|
-| Base contract | `nemo_curator/stages/base.py` |
-| I/O (tarred/sharded readers, manifest writers) | `nemo_curator/stages/audio/io/` |
-| Preprocessing | `nemo_curator/stages/audio/preprocessing/` |
-| Inference: ASR / VAD / speaker diarization | `nemo_curator/stages/audio/inference/{asr,vad,speaker_diarization}/` |
-| Filtering (band, SigMOS, etc.) | `nemo_curator/stages/audio/filtering/` |
-| Segmentation (speaker separation) | `nemo_curator/stages/audio/segmentation/` |
-| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/` |
-| Postprocessing | `nemo_curator/stages/audio/postprocessing/` |
-| Metrics / throughput | `nemo_curator/stages/audio/metrics/` |
-| Datasets (fleurs, readspeech) | `nemo_curator/stages/audio/datasets/` |
-| ALM (audio-language pretrain) | `nemo_curator/stages/audio/alm/` |
-| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/` |
-
-Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`,
-`ray_data/`, `ray_actor_pool/`); pipeline assembly:
-`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`.
-
-## Audio tutorials
-
-`tutorials/audio/` - `alm/`, `audio_pretrain/`, `callhome_diar/`, `fleurs/`,
-`readspeech/`, `single_speaker_filter/`, `tagging/`. Tutorials must be runnable,
-use public APIs only, and not duplicate pipeline-building logic that belongs in
-a shared builder.
-
-## Audio performance / benchmarking
-
-- `benchmarking/README.md` - how benchmarks are wired and run.
-- `benchmarking/AUDIO_PROFILING.md` - audio benchmark / perf expectations.
-- `benchmarking/ALM_BENCHMARK.md` - ALM pipeline benchmark.
-
-Standalone audio benchmark scripts should be integrated into this flow, not left
-freestanding.
-
-## Audio tests
-
-`tests/stages/audio/` mirrors `nemo_curator/stages/audio/` (io, inference,
-filtering, segmentation, tagging, postprocessing, preprocessing, metrics,
-datasets, alm, advanced_pipelines). Audio fixtures live under
-`tests/fixtures/audio/` and `tests/data/audio/`. GPU-only tests are marked
-`@pytest.mark.gpu`.
-
-## Contribution rules
-
-- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test
-  requirements, coverage threshold, copyright header.
-
-## GitHub data via `gh`
-
-`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir:
-
-| File | Endpoint |
-|------|----------|
-| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
-| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
-| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
-| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
-| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
-| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
-| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
-
-The REST inline-comments endpoint does not expose resolve/outdate state; the
-GraphQL thread payload does. The builder joins them by comment `databaseId`
-(== REST `id`), falling back to a `(path, body-prefix)` match when an older
-thread dump lacks `databaseId`.
-
-## Scripts
-
-- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
-- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the working digest + the prior-open-threads context file (you add your own findings).
-
-Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md
index d270082407..9b96ff1a68 100644
--- a/.cursor/skills/review-curator-audio-pr/scripts/README.md
+++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md
@@ -1,12 +1,22 @@
 # Scripts - review-curator-audio-pr
 
-Helpers for a **reviewer** reviewing someone else's audio Curator PR. They
-pull the PR and summarise existing review activity as context; you write the
-actual review findings.
+Helpers for a **reviewer** reviewing someone else's audio Curator PR. They pull
+the PR (and prior audio review history) and summarise it as context; you write
+the actual review findings. All require the GitHub CLI (`gh`) authenticated
+against github.com, and write to a scratch directory (default
+`.curator-pr-review/`) that is safe to delete; do not commit its contents.
 
-Both require the GitHub CLI (`gh`) authenticated against github.com. They write
-to a scratch directory (default `.curator-pr-review/`) that is safe to delete;
-do not commit its contents.
+## ensure_repo.sh
+
+```bash
+eval "$(.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh | tail -1)"  # CURATOR_REPO=<path>
+```
+
+Avoids cloning the whole repo unnecessarily: reuses an existing checkout if you
+are already inside one or there is one under the current directory tree
+(bounded by `MAXDEPTH`, default 4); only then `git clone --depth 1` (no full
+history) from <https://github.com/NVIDIA-NeMo/Curator>. Prints
+`CURATOR_REPO=<absolute path>` on the last line.
 
 ## pr_review_pull.sh
 
@@ -36,5 +46,22 @@ still-open threads other reviewers left, so you don't duplicate them).
 
 The thread join uses comment `databaseId` when present, else a
 (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
-correctly. The scripts fetch any PR; the audio focus is applied in the review
-lenses (see ../recurring-themes.md).
+correctly.
+
+## pull_audio_pr_corpus.sh + build_corpus.py (pre-review learning)
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
+```
+
+`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are
+monotonic in time, so this is "opened after that PR"), keeps the ones touching
+audio paths, and pulls each one's reviews + inline + issue comments into
+`.curator-pr-review/audio-corpus/`. `build_corpus.py` consolidates them into
+`audio_pr_corpus_<date>.md`: one section per audio PR with every reviewer comment
+verbatim (anchored to path:line) plus a recurring-themes tally. Read-only
+context; it posts nothing. See `../knowledge-sources.md` section 4.
+
+The single-PR scripts fetch any PR; the audio focus is applied by the review
+lenses in `../knowledge-sources.md`.
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
new file mode 100755
index 0000000000..2df94e3e78
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
@@ -0,0 +1,159 @@
+#!/usr/bin/env python3
+"""Consolidate the audio PR corpus pulled by pull_audio_pr_corpus.sh.
+
+Reads the per-PR JSON in the corpus dir and writes one markdown file with every
+reviewer comment (verbatim, anchored to path:line) grouped by PR, plus a
+recurring-themes keyword tally. Read-only context for pre-review; posts nothing.
+
+Usage: build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]
+"""
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import json
+import re
+from pathlib import Path
+
+BOT_LOGINS = {"greptile-apps[bot]", "copy-pr-bot[bot]", "github-actions[bot]"}
+
+# (label, regex) recurring-theme buckets, mirroring knowledge-sources.md lenses.
+THEMES = [
+    ("setup/teardown lifecycle", r"setup_on_node|\bsetup\(|teardown|_setup_done"),
+    ("optional/lazy imports", r"top[- ]level import|lazy import|import .* fails|optional (dep|extra)"),
+    ("dependency declaration/pins", r"pyproject|optional[- ]?group|==|version pin|requirement"),
+    ("stage contract inputs/outputs", r"inputs\(\)|outputs\(\)|validate_input|NotImplementedError"),
+    ("batch_size / process_batch", r"batch_size|process_batch"),
+    ("memory / serialization", r"ndarray|json\.dumps|serializ|waveform|tensor|OOM|memory"),
+    ("fsspec / cloud I/O", r"fsspec|url_to_fs|s3|gcs|http"),
+    ("secrets / logging", r"token|secret|credential|password|redact"),
+    ("tests / coverage", r"\btest|coverage|pytest|fixture"),
+    ("copyright / lint", r"copyright|header|ruff|lint"),
+    ("naming / convention", r"naming|rename|convention|AudioTask|AudioBatch"),
+    ("trust_remote_code", r"trust_remote_code"),
+]
+
+
+def load(p: Path):
+    return json.loads(p.read_text()) if p.exists() else []
+
+
+def shorten(s: str, n: int = 1200) -> str:
+    s = (s or "").strip()
+    return s if len(s) <= n else s[:n] + " […]"
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--outdir", default=".curator-pr-review/audio-corpus")
+    ap.add_argument("--today", default=dt.date.today().isoformat())
+    args = ap.parse_args()
+
+    outdir = Path(args.outdir)
+    nums_file = outdir / "_audio_pr_numbers.txt"
+    if not nums_file.exists():
+        raise SystemExit(f"no {nums_file}; run pull_audio_pr_corpus.sh first")
+    numbers = [int(x) for x in nums_file.read_text().split() if x.strip()]
+    numbers.sort(reverse=True)
+
+    date_us = args.today.replace("-", "_")
+    theme_counts = {label: 0 for label, _ in THEMES}
+    theme_rx = [(label, re.compile(rx, re.I)) for label, rx in THEMES]
+
+    out: list[str] = []
+    out.append(f"# Audio PR review corpus (post-#1608) - {args.today}\n")
+    out.append(
+        "Consolidated reviewer feedback on audio PRs opened after #1608 "
+        "(open + closed/merged). Read-only pre-review context: recognise the "
+        "patterns reviewers repeatedly raise, and check the PR in front of you "
+        "against them. Bot reviewers are marked `[bot]`.\n"
+    )
+    out.append(f"Audio PRs in corpus: **{len(numbers)}** "
+               f"({', '.join('#' + str(n) for n in numbers)})\n")
+
+    per_pr_sections: list[str] = []
+    total_comments = 0
+    for n in numbers:
+        gh = load(outdir / f"pr{n}_gh.json")
+        if isinstance(gh, list):
+            gh = gh[0] if gh else {}
+        reviews = load(outdir / f"pr{n}_reviews.json")
+        rcomments = load(outdir / f"pr{n}_review_comments.json")
+        icomments = load(outdir / f"pr{n}_issue_comments.json")
+
+        author = (gh.get("author") or {}).get("login", "?")
+        state = gh.get("state", "?")
+        title = gh.get("title", "")
+        url = gh.get("url", f"https://github.com/NVIDIA-NeMo/Curator/pull/{n}")
+
+        sec: list[str] = []
+        sec.append(f"## PR #{n} - {title}\n")
+        sec.append(f"- state: **{state}**  author: @{author}  "
+                   f"created: {gh.get('createdAt','?')}  link: {url}\n")
+
+        # Review summaries (non-empty bodies only).
+        rev_bodies = [r for r in reviews if (r.get("body") or "").strip()]
+        if rev_bodies:
+            sec.append("### Review summaries\n")
+            for r in rev_bodies:
+                login = (r.get("user") or {}).get("login", "?")
+                bot = " `[bot]`" if login in BOT_LOGINS else ""
+                sec.append(f"- **@{login}{bot}** [{r.get('state','')}] "
+                           f"{r.get('submitted_at','')}:\n")
+                sec.append(f"  > {shorten(r.get('body'))}\n")
+                for label, rx in theme_rx:
+                    if rx.search(r.get("body") or ""):
+                        theme_counts[label] += 1
+
+        # Inline comments grouped by file.
+        by_file: dict[str, list] = {}
+        for c in rcomments:
+            by_file.setdefault(c.get("path", "?"), []).append(c)
+        if by_file:
+            sec.append("### Inline review comments\n")
+            for path in sorted(by_file):
+                sec.append(f"#### `{path}`\n")
+                for c in sorted(by_file[path],
+                                key=lambda x: (x.get("line") or x.get("original_line") or 0)):
+                    login = (c.get("user") or {}).get("login", "?")
+                    bot = " `[bot]`" if login in BOT_LOGINS else ""
+                    line = c.get("line") or c.get("original_line") or "?"
+                    body = c.get("body") or ""
+                    total_comments += 1
+                    sec.append(f"- **@{login}{bot}** line {line} "
+                               f"([link]({c.get('html_url','')})):\n")
+                    sec.append(f"  > {shorten(body)}\n")
+                    for label, rx in theme_rx:
+                        if rx.search(body):
+                            theme_counts[label] += 1
+
+        # Top-level issue comments (skip pure CI noise).
+        human_ic = [c for c in icomments
+                    if (c.get("user") or {}).get("login") not in BOT_LOGINS
+                    and (c.get("body") or "").strip()]
+        if human_ic:
+            sec.append("### Discussion (top-level)\n")
+            for c in human_ic:
+                login = (c.get("user") or {}).get("login", "?")
+                sec.append(f"- **@{login}** {c.get('created_at','')}: "
+                           f"{shorten(c.get('body'), 600)}\n")
+
+        per_pr_sections.append("\n".join(sec))
+
+    # Theme tally up top.
+    out.append("## Recurring themes (comment hits across the corpus)\n")
+    out.append("| Theme | Comments mentioning it |")
+    out.append("|-------|------------------------|")
+    for label, _ in THEMES:
+        out.append(f"| {label} | {theme_counts[label]} |")
+    out.append(f"\nTotal inline review comments scanned: **{total_comments}**\n")
+    out.append("---\n")
+    out.extend(per_pr_sections)
+
+    outpath = outdir / f"audio_pr_corpus_{date_us}.md"
+    outpath.write_text("\n".join(out) + "\n", encoding="utf-8")
+    print(f"wrote {outpath}  ({outpath.stat().st_size} bytes; {len(numbers)} PRs)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh
new file mode 100755
index 0000000000..4b9c914c39
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env bash
+# Make a local NVIDIA-NeMo/Curator checkout available WITHOUT cloning the whole
+# repo unnecessarily.
+#
+# Strategy (cheapest first):
+#   1. If we are already inside a Curator checkout, use it.
+#   2. Else search the current directory tree (bounded depth) for an existing
+#      Curator checkout and reuse it.
+#   3. Else shallow-clone (--depth 1, no full history) into ./Curator.
+#
+# Prints diagnostics to stderr and, on the LAST stdout line:
+#   CURATOR_REPO=<absolute path to the checkout>
+#
+# Usage: ensure_repo.sh [CLONE_DIR]   (default CLONE_DIR=Curator)
+#   MAXDEPTH=<n>   bound the downward search (default 4)
+#
+# Clone source: https://github.com/NVIDIA-NeMo/Curator
+set -euo pipefail
+
+REPO_URL="https://github.com/NVIDIA-NeMo/Curator.git"
+CLONE_DIR="${1:-Curator}"
+MAXDEPTH="${MAXDEPTH:-4}"
+MARKER="nemo_curator/stages/audio/README.md"
+
+is_curator() { [[ -f "$1/${MARKER}" && -d "$1/.cursor/rules" ]]; }
+abspath()    { (cd "$1" && pwd); }
+
+# 1) Already inside a Curator checkout?
+if top="$(git rev-parse --show-toplevel 2>/dev/null)" && is_curator "${top}"; then
+    echo "ensure_repo: already inside Curator checkout (${top}); not cloning." >&2
+    echo "CURATOR_REPO=${top}"; exit 0
+fi
+if is_curator "."; then
+    echo "ensure_repo: current directory is a Curator checkout; not cloning." >&2
+    echo "CURATOR_REPO=$(abspath .)"; exit 0
+fi
+
+# 2) Existing checkout somewhere under the current tree?
+while IFS= read -r marker; do
+    cand="${marker%/${MARKER}}"
+    if is_curator "${cand}"; then
+        echo "ensure_repo: reusing existing checkout at ${cand}; not cloning." >&2
+        echo "CURATOR_REPO=$(abspath "${cand}")"; exit 0
+    fi
+done < <(find . -maxdepth "${MAXDEPTH}" -path "*/${MARKER}" 2>/dev/null)
+
+# 3) Reuse a prior clone dir if present, else shallow-clone.
+if is_curator "${CLONE_DIR}"; then
+    echo "ensure_repo: reusing ${CLONE_DIR}; not cloning." >&2
+    echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")"; exit 0
+fi
+
+command -v git >/dev/null || { echo "ensure_repo: git not found" >&2; exit 2; }
+echo "ensure_repo: no local checkout found; shallow-cloning ${REPO_URL} -> ${CLONE_DIR}" >&2
+git clone --depth 1 "${REPO_URL}" "${CLONE_DIR}"
+is_curator "${CLONE_DIR}" || { echo "ensure_repo: clone did not produce a valid checkout" >&2; exit 1; }
+echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")"
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
new file mode 100755
index 0000000000..50ccb82271
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
@@ -0,0 +1,84 @@
+#!/usr/bin/env bash
+# Discover audio-modality PRs opened AFTER a baseline PR (#1608 by default),
+# open or closed/merged, and pull each one's reviews + comments into a corpus
+# directory. Consolidate with build_corpus.py.
+#
+# PR numbers are monotonic in time, so "number > --since" == "opened after that
+# PR". We then keep only PRs whose changed files touch audio paths.
+#
+# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]
+#
+# Requires the GitHub CLI (`gh`) authenticated against github.com.
+set -euo pipefail
+
+REPO="NVIDIA-NeMo/Curator"
+SINCE=1608
+OUTDIR=".curator-pr-review/audio-corpus"
+LIMIT=600
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --since)  SINCE="$2";  shift 2 ;;
+        --outdir) OUTDIR="$2"; shift 2 ;;
+        --repo)   REPO="$2";   shift 2 ;;
+        --limit)  LIMIT="$2";  shift 2 ;;
+        -h|--help)
+            echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]"; exit 0 ;;
+        *) echo "unknown arg: $1" >&2; exit 2 ;;
+    esac
+done
+
+command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; }
+mkdir -p "${OUTDIR}"
+
+# Audio-path test (extended regex over a PR's changed-file list).
+AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))'
+
+echo "=== corpus discovery: ${REPO} PRs > #${SINCE} (state=all, limit=${LIMIT}) ===" >&2
+gh pr list --repo "${REPO}" --state all --limit "${LIMIT}" \
+    --json number,title,state,author,createdAt,updatedAt,url \
+    > "${OUTDIR}/_all_prs.json"
+
+# Candidate PR numbers (> SINCE), newest first.
+mapfile -t CANDIDATES < <(python3 - "${OUTDIR}/_all_prs.json" "${SINCE}" <<'PY'
+import json, sys
+prs = json.load(open(sys.argv[1])); since = int(sys.argv[2])
+nums = sorted((p["number"] for p in prs if p["number"] > since), reverse=True)
+print("\n".join(str(n) for n in nums))
+PY
+)
+echo "candidates after #${SINCE}: ${#CANDIDATES[@]}" >&2
+
+AUDIO_NUMS=()
+for n in "${CANDIDATES[@]}"; do
+    files_json="${OUTDIR}/pr${n}_files.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \
+        > "${files_json}" 2>/dev/null || { echo "  pr${n}: files fetch failed, skip" >&2; continue; }
+    if AUDIO_RE="${AUDIO_RE}" python3 - "${files_json}" <<'PY'
+import json, os, re, sys
+files = json.load(open(sys.argv[1]))
+rx = re.compile(os.environ["AUDIO_RE"])
+sys.exit(0 if any(rx.search(f) for f in files) else 1)
+PY
+    then
+        AUDIO_NUMS+=("${n}")
+        echo "  pr${n}: AUDIO" >&2
+    else
+        rm -f "${files_json}"
+    fi
+done
+
+echo "audio PRs after #${SINCE}: ${#AUDIO_NUMS[@]}" >&2
+printf '%s\n' "${AUDIO_NUMS[@]}" > "${OUTDIR}/_audio_pr_numbers.txt"
+
+for n in "${AUDIO_NUMS[@]}"; do
+    echo "--- pulling pr${n} reviews/comments ---" >&2
+    gh pr view "${n}" --repo "${REPO}" \
+        --json number,title,state,author,createdAt,updatedAt,mergedAt,closedAt,url,body \
+        > "${OUTDIR}/pr${n}_gh.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/reviews"  > "${OUTDIR}/pr${n}_reviews.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/comments" > "${OUTDIR}/pr${n}_review_comments.json"
+    gh api --paginate "repos/${REPO}/issues/${n}/comments" > "${OUTDIR}/pr${n}_issue_comments.json"
+done
+
+echo "AUDIO_PR_CORPUS_PULL_DONE  outdir=${OUTDIR}  audio_prs=${#AUDIO_NUMS[@]}" >&2

From b24b6472f8de332a5e09b70a5b235543cf7cde4d Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 15:13:55 +0530
Subject: [PATCH 05/10] Link published fern audio docs and exact dev-guide
 slide in knowledge sources

Audit gap: the knowledge guide cited READMEs, .cursor/rules, and the slides deck but not the published audio documentation set under fern/versions/v26.04/pages/curate-audio. Add a fern doc map (concept -> doc -> code), wire fern refs into the memory/throughput/tutorial lenses, include the readspeech notebook, and pin the slides URL to the exact slide anchor.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../knowledge-sources.md                      | 36 ++++++++++++++++---
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
index cb76080537..93690f8587 100644
--- a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
+++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
@@ -22,7 +22,7 @@ the GitHub CLI (`gh`). Scope: the **audio** modality.
 | Source | What it gives you |
 |--------|-------------------|
 | `nemo_curator/stages/audio/README.md` | The audio stage **developer guide** (1000+ lines): CPU vs GPU stages, `process` / `process_batch`, `batch_size`, call chains, per-stage memory characteristics, end-to-end FLEURS & ALM traces, new-stage checklist. Read first. |
-| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). |
+| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit?slide=id.g3be823d0e3d_0_0#slide=id.g3be823d0e3d_0_0> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). |
 | `nemo_curator/tasks/audio_task.py` | `AudioTask(Task[dict])` + `_AttrDict` task model (one manifest entry per task; dict keys exposed as attributes for `validate_input`). |
 | `nemo_curator/stages/base.py` | `ProcessingStage` contract: `inputs()`/`outputs()`, `process`/`process_batch`, `setup_on_node`/`setup`/`teardown`, `validate_input`. |
 | `CONTRIBUTING.md` | DCO sign-off, PR process, required tests, coverage threshold, copyright header. |
@@ -44,6 +44,34 @@ the GitHub CLI (`gh`). Scope: the **audio** modality.
 | `.cursor/rules/modality-structure.mdc` | Where modality code/tests/tutorials live |
 | `.cursor/rules/coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
 
+### Published audio documentation (fern docs site)
+
+The user-facing audio docs live under `fern/versions/<ver>/pages/` (latest
+versioned snapshot: **v26.04**; `fern/versions/main/` holds in-progress docs).
+These are the canonical "what the docs promise" reference - cite them when a PR's
+behavior, config, or API drifts from documented behavior. Paths below are
+relative to `fern/versions/v26.04/pages/`.
+
+| Audio concept / stage | Fern doc(s) | Code it documents |
+|------------------------|-------------|-------------------|
+| Overview / get started | `get-started/audio.mdx`, `curate-audio/index.mdx` | - |
+| Concepts (task model, pipelines, metrics) | `about/concepts/audio/{index,audio-task,curation-pipeline,asr-pipeline,alm-pipeline,manifests-ingest,quality-metrics,text-integration}.mdx` | `nemo_curator/tasks/audio_task.py` |
+| Load data / manifests | `curate-audio/load-data/{index,fleurs-dataset,custom-manifests,local-files}.mdx` | `nemo_curator/stages/audio/datasets/`, `.../io/` |
+| ASR inference | `curate-audio/process-data/asr-inference/{index,nemo-models}.mdx` | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` |
+| Audio analysis (duration, format) | `curate-audio/process-data/audio-analysis/{index,duration-calculation,format-validation}.mdx` | `nemo_curator/stages/audio/common.py` |
+| Quality filtering (band, VAD, SigMOS, UTMOS, speaker separation, preprocessing, data-filter) | `curate-audio/process-data/quality-filtering/{index,band-filter,vad,sigmos,utmos,speaker-separation,preprocessing,audio-data-filter-stage}.mdx` | `.../filtering/`, `.../segmentation/`, `.../preprocessing/`, `.../advanced_pipelines/` |
+| Quality assessment (WER, duration) | `curate-audio/process-data/quality-assessment/{index,wer-filtering,duration-filtering}.mdx` | `nemo_curator/stages/audio/metrics/wer.py`, `.../common.py` |
+| ALM (data builder, overlap filtering) | `curate-audio/process-data/alm/{index,data-builder,overlap-filtering}.mdx` | `nemo_curator/stages/audio/alm/` |
+| Text integration / tagging | `curate-audio/process-data/text-integration/index.mdx` | `nemo_curator/stages/audio/tagging/` |
+| Save / export | `curate-audio/save-export.mdx` | `nemo_curator/stages/audio/io/convert.py` |
+| Tutorials | `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx` | `tutorials/audio/` |
+| API reference | `api-reference/tasks/audio-task.mdx`, `api-reference/processing-stage.mdx` | `nemo_curator/tasks/audio_task.py`, `nemo_curator/stages/base.py` |
+| Infrastructure (memory, per-stage runtime, backends, GPU, resumable, monitoring) | `reference/infrastructure/{memory-management,per-stage-runtime,execution-backends,gpu-processing,resumable-processing,monitoring}.mdx` | `nemo_curator/backends/` |
+| Release notes / migration (`AudioBatch -> AudioTask`, #1608) | `about/release-notes/{index,migration-guide,migration-faq}.mdx` | see PR #1608 |
+
+When a PR changes audio behavior, check whether the matching fern page above
+needs updating in the same PR - docs drift is a common review finding.
+
 ---
 
 ## 1. Audio code map (diff ground truth)
@@ -152,7 +180,7 @@ canonical sources to cite.
 - Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for
   non-regular members even after `isfile()` - check before `.read()`. Handle
   resampling / channel-count / dtype mismatches explicitly.
-- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics".
+- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics"; fern `reference/infrastructure/{memory-management,per-stage-runtime}.mdx`.
 
 ### 2.5 Tarred / sharded audio I/O
 
@@ -187,7 +215,7 @@ canonical sources to cite.
   totals/averages.
 - Standalone audio benchmark scripts belong in the `benchmarking/` flow with
   config entries and comparable parameters - not freestanding.
-- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`.
+- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`; fern `reference/infrastructure/{execution-backends,per-stage-runtime,monitoring,gpu-processing}.mdx`.
 
 ### 2.8 Tutorials and docs
 
@@ -199,7 +227,7 @@ canonical sources to cite.
 - Every config key documented for users must actually reach a stage; validate
   config-to-stage mapping and fail (or warn) on unused keys.
 - Don't commit generated analysis/scratch docs into the source tree.
-- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `nemo_curator/stages/audio/README.md`.
+- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `tutorials/audio/readspeech/readspeech_tutorial.ipynb`; `nemo_curator/stages/audio/README.md`; fern `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx`.
 
 ### 2.9 Tests and standards (CONTRIBUTING + coding-standards)
 

From 3858d0eaa9e19768cb734371227c3b69a6463dc8 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 15:21:44 +0530
Subject: [PATCH 06/10] Lead the review with a detailed PR overview before any
 findings

build_digest.py now emits a What this PR does (overview) section first: author description, an areas-touched table, and a plain-language summary placeholder, ahead of PR state, existing reviews/comments, and findings. SKILL.md and templates.md require presenting this overview before any review comments (new Step 6).

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 37 +++++++++-----
 .../scripts/build_digest.py                   | 49 +++++++++++++++++++
 .../review-curator-audio-pr/templates.md      | 29 +++++++++--
 3 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index 30078000b0..f308b2ef30 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -47,16 +47,21 @@ with *"build the post-#1608 audio review corpus first, then review PR <N>"*
 
 ## What you produce
 
-Your review = a set of **findings** (P0-P3), each tied to a `path:line` on the
-PR's current head, that you post as PR review comments. The skill writes two
-helper files into a scratch directory (default `.curator-pr-review/`, override
-with `--outdir`) to support that:
-
-1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: PR
+You produce two things, **in this order**: (1) a **detailed overview of what the
+PR does and why** - so the reader understands the change before any critique -
+then (2) your **findings** (P0-P3), each tied to a `path:line` on the PR's
+current head, that you post as PR review comments. Never lead with review
+comments; always explain the PR first.
+
+The skill writes two helper files into a scratch directory (default
+`.curator-pr-review/`, override with `--outdir`) to support that:
+
+1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: leads
+   with a **"What this PR does (overview)"** section (the author's description,
+   an areas-touched table, and a plain-language summary you write), then PR
    state, commits, changed-file table, and every existing review/comment grouped
-   by file with OPEN / OUTDATED / RESOLVED status, plus a placeholder for your
-   findings. Read it to understand the PR and to see what other reviewers already
-   raised so you don't repeat them; record your findings at the bottom.
+   by file with OPEN / OUTDATED / RESOLVED status, then a placeholder for your
+   findings. Fill in the overview first, then record your findings at the bottom.
 2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**:
    a condensed list of the threads other reviewers left that are still
    unresolved on the current head. Scan it before writing your own comments.
@@ -74,7 +79,8 @@ review comments. The review lenses and every doc/code reference live in
 - [ ] 3. Read the diff
 - [ ] 4. Review the changed code through the audio review lenses
 - [ ] 5. Generate the digest + prior-threads file for context
-- [ ] 6. Write up your findings (P0-P3) and post them as review comments
+- [ ] 6. Write the detailed PR overview (present this first)
+- [ ] 7. Write up your findings (P0-P3) and post them after the overview
 ```
 
 ### Step 0 - Locate or shallow-clone the repo
@@ -140,7 +146,16 @@ status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head),
 **RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer
 already made.
 
-### Step 6 - Write up and post your findings
+### Step 6 - Summarize the PR in detail (present this first)
+
+Before any critique, write the **"What this PR does (overview)"** section of the
+digest so a reader understands the change end to end: the problem it solves; the
+main audio stages/files it adds or modifies (use the "Areas touched" table the
+builder generates); key design decisions; new dependencies, config, or APIs; and
+the blast radius (what could regress). Lead your review with this overview - the
+author and other reviewers read it before any findings.
+
+### Step 7 - Write up and post your findings
 
 Record each issue as a finding, classified by severity, and post them as a PR
 review (inline comments for specific lines, a top-level summary for the overall
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
index 8c129103e1..923323d102 100755
--- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
@@ -91,6 +91,24 @@ def p(parts, *lines):
     parts.extend(lines)
 
 
+def area_of(path):
+    """Group a changed file into a coarse audio area for the overview table."""
+    parts = path.split("/")
+    if path.startswith("nemo_curator/stages/audio/") and len(parts) >= 4:
+        return f"stages/audio/{parts[3]}"
+    if path.startswith("nemo_curator/tasks/"):
+        return "tasks"
+    if path.startswith("tutorials/audio/") and len(parts) >= 3:
+        return f"tutorials/audio/{parts[2]}"
+    if path.startswith("tests/"):
+        return "tests"
+    if path.startswith("benchmarking/"):
+        return "benchmarking"
+    if path.startswith("fern/"):
+        return "docs (fern)"
+    return parts[0] if parts else path
+
+
 def build(pr, outdir, today, baseline_ts, prev_head):
     def latest(kind):
         return outdir / f"pr{pr}_{kind}_latest.json"
@@ -130,6 +148,37 @@ def base(kind):
         p(d, f"Previous reviewed head: `{prev_head}`", "")
     p(d, "")
 
+    # ----- PR OVERVIEW (what the PR is about; present BEFORE any review) -----
+    author = (gh.get("author") or {}).get("login", "?")
+    p(d, "## What this PR does (overview)", "")
+    p(d, f"**{gh.get('title','')}** - by @{author}  "
+         f"({gh.get('state')}, +{gh.get('additions')}/-{gh.get('deletions')} "
+         f"across {gh.get('changedFiles')} files, {len(commits)} commits)", "")
+
+    p(d, "### Author's description", "")
+    pr_body = (gh.get("body") or "").strip()
+    p(d, pr_body if pr_body else "_(no PR description provided)_", "")
+
+    p(d, "### Areas touched", "")
+    areas = {}
+    for f in files:
+        e = areas.setdefault(area_of(f["filename"]), {"n": 0, "add": 0, "dele": 0})
+        e["n"] += 1
+        e["add"] += f.get("additions", 0)
+        e["dele"] += f.get("deletions", 0)
+    p(d, "| Area | files | +/- |", "|---|---|---|")
+    for a in sorted(areas):
+        e = areas[a]
+        p(d, f"| {a} | {e['n']} | +{e['add']} -{e['dele']} |")
+    p(d, "")
+
+    p(d, "### Plain-language summary (reviewer writes this BEFORE any findings)", "")
+    p(d, "_In your own words, explain in detail what this PR changes and why: the "
+         "problem it solves, the main audio stages/files it adds or modifies, key "
+         "design decisions, new dependencies/config/APIs, and the blast radius "
+         "(what could regress). Present this overview first; only then move to the "
+         "findings below._", "")
+
     p(d, "## PR state at review time", "")
     p(d, "| Field | Value |", "|---|---|")
     p(d, f"| state | {gh.get('state')} |")
diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md
index f789a09aad..3e0d1b6669 100644
--- a/.cursor/skills/review-curator-audio-pr/templates.md
+++ b/.cursor/skills/review-curator-audio-pr/templates.md
@@ -2,14 +2,17 @@
 
 You are the reviewer. `scripts/build_digest.py` generates files A and B as
 context; you author the findings (section C) and post them as your review.
-Replace `<...>` placeholders. Keep the structure stable across review passes so
-a re-review diffs cleanly and only surfaces what changed.
+The digest **leads with a PR overview** that you must complete and present
+before any review comments. Replace `<...>` placeholders. Keep the structure
+stable across review passes so a re-review diffs cleanly and only surfaces what
+changed.
 
 ## A. Working digest (generated, then you append findings)
 
-File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. The generated part gives you
-PR state, the diff, and all existing review activity. Add your own **Findings**
-section at the end - that is your review.
+File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. It **leads with a PR
+overview** (what the change does and why), then PR state, the diff, and all
+existing review activity. Complete the overview's plain-language summary first
+and present it before any findings; add your **Findings** section at the end.
 
 ```markdown
 # Curator PR <N> Fresh Review - <YYYY-MM-DD>
@@ -20,6 +23,22 @@ Current PR head reviewed: `<headRefOid>`
 
 Base recorded by GitHub metadata: `<baseRefOid>`
 
+## What this PR does (overview)
+
+**<title>** - by @<author>  (<state>, +<a>/-<d> across <n> files, <k> commits)
+
+### Author's description
+<verbatim PR body>
+
+### Areas touched
+| Area | files | +/- |
+|---|---|---|
+| stages/audio/<area> | <n> | +<a> -<d> |
+
+### Plain-language summary (write this BEFORE any findings)
+<In your own words: what the PR changes and why, the main stages/files, key
+design decisions, new deps/config/APIs, and blast radius.>
+
 ## PR state at review time
 
 | Field | Value |

From 209332fe93bff9bc021c1d2bee4f86a7a61546c6 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Fri, 5 Jun 2026 15:24:16 +0530
Subject: [PATCH 07/10] Order workflow so the PR overview is written after
 reading knowledge sources

Make reading knowledge-sources.md an explicit Step 4 (canonical docs + code map + lenses) that precedes the detailed PR overview (Step 6); lens application moves into the findings step (Step 7). Overview placeholders in build_digest.py and templates.md note the summary is written after grounding in the knowledge sources.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 74 ++++++++++++-------
 .../scripts/build_digest.py                   |  3 +-
 .../review-curator-audio-pr/templates.md      |  5 +-
 3 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index f308b2ef30..46430b5fd2 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -34,10 +34,11 @@ you already have and only shallow-clones
 A human reviewer invokes this skill with a prompt like:
 
 > **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or
-> shallow-clone the repo, pull the PR and prior review activity, apply the audio
-> review lenses, and give me a review digest plus my own findings (P0-P3) with
-> exact `path:line` evidence and concrete fixes, ready to post as PR comments.
-> Skip anything other reviewers already raised."**
+> shallow-clone the repo, pull the PR and prior review activity, read the
+> knowledge sources, then give me a detailed overview of what the PR does
+> followed by my own findings (P0-P3) with exact `path:line` evidence and
+> concrete fixes, ready to post as PR comments. Skip anything other reviewers
+> already raised."**
 
 Variants: *"re-review PR <N> and show only what changed since the last review"*,
 *"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and
@@ -48,10 +49,12 @@ with *"build the post-#1608 audio review corpus first, then review PR <N>"*
 ## What you produce
 
 You produce two things, **in this order**: (1) a **detailed overview of what the
-PR does and why** - so the reader understands the change before any critique -
-then (2) your **findings** (P0-P3), each tied to a `path:line` on the PR's
-current head, that you post as PR review comments. Never lead with review
-comments; always explain the PR first.
+PR does and why** - written only after you have read the knowledge sources (the
+canonical docs and review lenses), so the reader understands the change before
+any critique - then (2) your **findings** (P0-P3), each tied to a `path:line` on
+the PR's current head, that you post as PR review comments. Never lead with
+review comments; ground yourself in the knowledge sources, then explain the PR,
+then give findings.
 
 The skill writes two helper files into a scratch directory (default
 `.curator-pr-review/`, override with `--outdir`) to support that:
@@ -77,10 +80,10 @@ review comments. The review lenses and every doc/code reference live in
 - [ ] 1. Identify the PR and confirm it touches audio paths
 - [ ] 2. Pull fresh GitHub data (gh)
 - [ ] 3. Read the diff
-- [ ] 4. Review the changed code through the audio review lenses
+- [ ] 4. Read the knowledge sources (canonical docs + review lenses) to ground yourself
 - [ ] 5. Generate the digest + prior-threads file for context
-- [ ] 6. Write the detailed PR overview (present this first)
-- [ ] 7. Write up your findings (P0-P3) and post them after the overview
+- [ ] 6. Write the detailed PR overview - only after steps 3-4 - present this first
+- [ ] 7. Review through the lenses, write up findings (P0-P3), and post them after the overview
 ```
 
 ### Step 0 - Locate or shallow-clone the repo
@@ -125,15 +128,22 @@ gh pr diff <N> --repo NVIDIA-NeMo/Curator        # works on a shallow clone
 shallow clone from step 0. Always review the actual diff, never the comment text
 alone. Use `main` as the baseline for "how the audio stages already do this".
 
-### Step 4 - Review through the audio lenses
+### Step 4 - Read the knowledge sources (ground yourself first)
 
-Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 -
-each lens links the audio code, README section, and `.cursor/rules` contract it
-governs: stage contracts, setup/teardown lifecycle, audio optional-dependency
-hygiene (vLLM, NeMo, model utils), secret-safe logging, waveform/tensor memory
-and manifest serialization, tarred/sharded audio I/O, sample-rate and metadata
-propagation, streaming/throughput, tutorials/docs, tests/coverage, and PR
-reviewability. Cite the linked source by name whenever the PR deviates.
+Before you explain or judge anything, read
+[knowledge-sources.md](knowledge-sources.md) so your overview and findings are
+grounded in the canonical material, not guesswork:
+
+- **section 0** - canonical docs: the audio stage developer guide
+  (`nemo_curator/stages/audio/README.md`), the developer-guide slides, the
+  published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`.
+- **section 1** - the audio code map, so you know where each changed file sits.
+- **section 2** - the review lenses (you apply these in step 7).
+
+Open the specific sources the diff touches (e.g. the fern page and the code for
+the stage being changed). First time reviewing audio in this repo, optionally
+build the post-#1608 reviewer-comment corpus (section 4) to see what reviewers
+repeatedly raise. Only after this grounding do you write the overview (step 6).
 
 ### Step 5 - Generate the context files
 
@@ -148,19 +158,29 @@ already made.
 
 ### Step 6 - Summarize the PR in detail (present this first)
 
-Before any critique, write the **"What this PR does (overview)"** section of the
-digest so a reader understands the change end to end: the problem it solves; the
+Now that you have read the diff (step 3) and the knowledge sources (step 4),
+write the **"What this PR does (overview)"** section of the digest so a reader
+understands the change end to end: the problem it solves; the
 main audio stages/files it adds or modifies (use the "Areas touched" table the
 builder generates); key design decisions; new dependencies, config, or APIs; and
 the blast radius (what could regress). Lead your review with this overview - the
 author and other reviewers read it before any findings.
 
-### Step 7 - Write up and post your findings
-
-Record each issue as a finding, classified by severity, and post them as a PR
-review (inline comments for specific lines, a top-level summary for the overall
-verdict). Each finding cites the exact `path:line-range` on the current head and
-proposes a concrete fix.
+### Step 7 - Review through the lenses, then write up and post your findings
+
+Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the
+diff - each lens links the audio code, README section, and `.cursor/rules`
+contract it governs: stage contracts, setup/teardown lifecycle, audio
+optional-dependency hygiene (vLLM, NeMo, model utils), secret-safe logging,
+waveform/tensor memory and manifest serialization, tarred/sharded audio I/O,
+sample-rate and metadata propagation, streaming/throughput, tutorials/docs,
+tests/coverage, and PR reviewability. Cite the linked source by name whenever
+the PR deviates.
+
+Record each issue as a finding, classified by severity, and post them **after
+the overview** as a PR review (inline comments for specific lines, a top-level
+summary for the overall verdict). Each finding cites the exact `path:line-range`
+on the current head and proposes a concrete fix.
 
 - **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a
   manifest writer that crashes on a kept waveform), a secret (HF token) leaked
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
index 923323d102..64ccba1fe0 100755
--- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
@@ -173,7 +173,8 @@ def base(kind):
     p(d, "")
 
     p(d, "### Plain-language summary (reviewer writes this BEFORE any findings)", "")
-    p(d, "_In your own words, explain in detail what this PR changes and why: the "
+    p(d, "_After reading the knowledge sources (knowledge-sources.md), explain in "
+         "detail, in your own words, what this PR changes and why: the "
          "problem it solves, the main audio stages/files it adds or modifies, key "
          "design decisions, new dependencies/config/APIs, and the blast radius "
          "(what could regress). Present this overview first; only then move to the "
diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md
index 3e0d1b6669..a9cc9548f9 100644
--- a/.cursor/skills/review-curator-audio-pr/templates.md
+++ b/.cursor/skills/review-curator-audio-pr/templates.md
@@ -36,8 +36,9 @@ Base recorded by GitHub metadata: `<baseRefOid>`
 | stages/audio/<area> | <n> | +<a> -<d> |
 
 ### Plain-language summary (write this BEFORE any findings)
-<In your own words: what the PR changes and why, the main stages/files, key
-design decisions, new deps/config/APIs, and blast radius.>
+<In your own words, after reading the knowledge sources: what the PR changes and
+why, the main stages/files, key design decisions, new deps/config/APIs, and
+blast radius.>
 
 ## PR state at review time
 

From f8ae3bd3b43c03e390ececdb7dfccd71d1794a80 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Tue, 9 Jun 2026 17:57:28 +0530
Subject: [PATCH 08/10] review-curator-audio-pr: audio-only enforcement,
 incremental corpus pulls, doc cosmetics

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 18 ++++---
 .../knowledge-sources.md                      |  9 ++--
 .../review-curator-audio-pr/scripts/README.md | 15 ++++--
 .../scripts/build_digest.py                   | 16 ++++++
 .../scripts/pr_review_pull.sh                 | 21 +++++++-
 .../scripts/pull_audio_pr_corpus.sh           | 50 ++++++++++++++++---
 .../review-curator-audio-pr/templates.md      | 16 ++++--
 7 files changed, 117 insertions(+), 28 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index 46430b5fd2..241a61a29e 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -19,10 +19,11 @@ Curator PR.** Given a PR number, it helps you understand the change, see what
 other reviewers already said, and write your own review. It does not help a PR
 author reply to reviewers - it helps you *produce* the review.
 
-Scope: the **audio** modality - code under `nemo_curator/stages/audio/`,
-`tutorials/audio/`, audio tests under `tests/stages/audio/`, and audio
-benchmarks. For non-audio PRs the stage-contract lenses still apply, but the
-audio-specific guidance below will not.
+Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`,
+`nemo_curator/tasks/audio_task.py`, `tutorials/audio/`, audio tests under
+`tests/stages/audio/`, and audio benchmarks. This skill is **audio-only**: if a
+PR touches no audio path, the pull step refuses to run (step 2) - review it
+with a modality-appropriate tool instead.
 
 Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and
 `git`. You do **not** need to pre-clone the repo - step 0 reuses any checkout
@@ -101,10 +102,11 @@ If you are already in the checkout that contains this skill, you can skip this.
 
 ### Step 1 - Identify the PR
 
-`gh pr view <N> --repo NVIDIA-NeMo/Curator`. Confirm the diff touches audio
-paths (`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`,
-`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks); if it does not,
-this audio skill is the wrong lens.
+`gh pr view <N> --repo NVIDIA-NeMo/Curator`. This skill is **audio-only**: the
+pull step (step 2) aborts if the PR touches no audio path
+(`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`,
+`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks). If it aborts, the
+PR is out of scope for this skill, so stop and do not review it here.
 
 ### Step 2 - Pull fresh GitHub data
 
diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
index 93690f8587..e2ef795727 100644
--- a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
+++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
@@ -291,7 +291,9 @@ consolidated corpus of that feedback:
 `pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are
 monotonic in time, so number > 1608 == opened after #1608), keeps the ones whose
 changed files touch audio paths, and pulls each one's reviews, inline comments,
-and issue comments. `build_corpus.py` writes
+and issue comments. It is incremental - reruns skip PRs already on disk and only
+fetch new ones (use `--refresh` to re-pull, e.g. to refresh open PRs).
+`build_corpus.py` writes
 `.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md`: one section per
 audio PR (number, title, state, author, link) with every reviewer comment
 verbatim, anchored to `path:line`, plus a recurring-themes tally.
@@ -330,8 +332,9 @@ thread dump lacks `databaseId`.
 renders the working digest + the prior-open-threads context file (you add your
 own findings).
 
-`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]`
-and `scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the
+`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]`
+(incremental - skips PRs already on disk; `--refresh` re-pulls) and
+`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the
 post-#1608 corpus (section 4).
 
 Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete).
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md
index 9b96ff1a68..e210b0c062 100644
--- a/.cursor/skills/review-curator-audio-pr/scripts/README.md
+++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md
@@ -27,6 +27,8 @@ history) from <https://github.com/NVIDIA-NeMo/Curator>. Prints
 Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
 `comments`, `files`, `commits`) plus the GraphQL review threads into
 `pr<N>_*_latest.json` (and timestamped snapshots for delta analysis).
+**Audio-only:** it pulls metadata + files first and aborts (exit 3) if the PR
+touches no audio path, so the skill never reviews a non-audio PR.
 
 ## build_digest.py
 
@@ -46,7 +48,7 @@ still-open threads other reviewers left, so you don't duplicate them).
 
 The thread join uses comment `databaseId` when present, else a
 (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
-correctly.
+correctly. It also re-checks the audio-only guard and aborts on a non-audio PR.
 
 ## pull_audio_pr_corpus.sh + build_corpus.py (pre-review learning)
 
@@ -58,10 +60,15 @@ correctly.
 `pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are
 monotonic in time, so this is "opened after that PR"), keeps the ones touching
 audio paths, and pulls each one's reviews + inline + issue comments into
-`.curator-pr-review/audio-corpus/`. `build_corpus.py` consolidates them into
+`.curator-pr-review/audio-corpus/`. It is **incremental**: PRs already pulled are
+skipped (and non-audio candidates are remembered in `_non_audio_prs.txt`), so a
+rerun only fetches PRs new since last time. Pass `--refresh` to re-pull
+everything (e.g. to pick up new comments on still-open PRs).
+`build_corpus.py` consolidates them into
 `audio_pr_corpus_<date>.md`: one section per audio PR with every reviewer comment
 verbatim (anchored to path:line) plus a recurring-themes tally. Read-only
 context; it posts nothing. See `../knowledge-sources.md` section 4.
 
-The single-PR scripts fetch any PR; the audio focus is applied by the review
-lenses in `../knowledge-sources.md`.
+The single-PR scripts are **audio-only**: both `pr_review_pull.sh` and
+`build_digest.py` abort on a PR that touches no audio path. The audio review
+lenses live in `../knowledge-sources.md`.
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
index 64ccba1fe0..7b920acd18 100755
--- a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
@@ -19,10 +19,18 @@
 import argparse
 import datetime as dt
 import json
+import re
 from pathlib import Path
 
 BODY_KEY_LEN = 120
 
+# This skill is audio-only; a PR must touch at least one of these paths.
+AUDIO_RE = re.compile(
+    r"^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|"
+    r"tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|"
+    r"benchmarking/.*([Aa]udio|ALM|alm))"
+)
+
 
 def load(path: Path) -> object:
     return json.loads(path.read_text())
@@ -119,6 +127,14 @@ def latest(kind):
     issue_comments = load(latest("issue_comments"))
     files = load(latest("files"))
     commits = load(latest("commits"))
+
+    if files and not any(AUDIO_RE.search(f.get("filename", "")) for f in files):
+        raise SystemExit(
+            f"PR {pr} touches no audio path; review-curator-audio-pr is audio-only. "
+            "Aborting (audio paths: nemo_curator/stages/audio/, "
+            "nemo_curator/tasks/audio_task.py, tutorials/audio/, "
+            "tests/stages/audio/, audio benchmarks)."
+        )
     threads_path = latest("review_threads")
     idx = build_thread_index(load(threads_path)) if threads_path.exists() else build_thread_index({})
 
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
index f14f649ed3..9092688f89 100755
--- a/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
+++ b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
@@ -48,14 +48,31 @@ GH_FIELDS="number,title,state,isDraft,mergeable,mergeStateStatus,headRefName,hea
 
 pull_endpoint "pr ${PR} metadata" "${OUTDIR}/pr${PR}_gh_${TS}.json" \
     gh pr view "${PR}" --repo "${REPO}" --json "${GH_FIELDS}"
+pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/files"
+
+# This skill is audio-only: abort before pulling anything else if the PR's
+# changed files touch no audio path.
+AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))'
+if ! AUDIO_RE="${AUDIO_RE}" python3 - "${OUTDIR}/pr${PR}_files_${TS}.json" <<'PY'
+import json, os, re, sys
+files = [f.get("filename", "") for f in json.load(open(sys.argv[1]))]
+rx = re.compile(os.environ["AUDIO_RE"])
+sys.exit(0 if any(rx.search(f) for f in files) else 1)
+PY
+then
+    echo "error: PR ${PR} touches no audio path; review-curator-audio-pr is audio-only." >&2
+    echo "       audio paths: nemo_curator/stages/audio/, nemo_curator/tasks/audio_task.py," >&2
+    echo "       tutorials/audio/, tests/stages/audio/, audio benchmarks. Aborting." >&2
+    exit 3
+fi
+
 pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
     gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"
 pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
     gh api --paginate "repos/${REPO}/pulls/${PR}/comments"
 pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
     gh api --paginate "repos/${REPO}/issues/${PR}/comments"
-pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
-    gh api --paginate "repos/${REPO}/pulls/${PR}/files"
 pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
     gh api --paginate "repos/${REPO}/pulls/${PR}/commits"
 
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
index 50ccb82271..a88cdb8659 100755
--- a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
+++ b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
@@ -6,7 +6,11 @@
 # PR numbers are monotonic in time, so "number > --since" == "opened after that
 # PR". We then keep only PRs whose changed files touch audio paths.
 #
-# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]
+# Incremental by default: PRs already pulled into OUTDIR are skipped, so reruns
+# only fetch PRs that are new since last time. Pass --refresh to re-pull
+# everything (e.g. to pick up new comments on open PRs).
+#
+# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]
 #
 # Requires the GitHub CLI (`gh`) authenticated against github.com.
 set -euo pipefail
@@ -15,15 +19,17 @@ REPO="NVIDIA-NeMo/Curator"
 SINCE=1608
 OUTDIR=".curator-pr-review/audio-corpus"
 LIMIT=600
+REFRESH=0
 
 while [[ $# -gt 0 ]]; do
     case "$1" in
-        --since)  SINCE="$2";  shift 2 ;;
-        --outdir) OUTDIR="$2"; shift 2 ;;
-        --repo)   REPO="$2";   shift 2 ;;
-        --limit)  LIMIT="$2";  shift 2 ;;
+        --since)   SINCE="$2";  shift 2 ;;
+        --outdir)  OUTDIR="$2"; shift 2 ;;
+        --repo)    REPO="$2";   shift 2 ;;
+        --limit)   LIMIT="$2";  shift 2 ;;
+        --refresh) REFRESH=1;   shift ;;
         -h|--help)
-            echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N]"; exit 0 ;;
+            echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]"; exit 0 ;;
         *) echo "unknown arg: $1" >&2; exit 2 ;;
     esac
 done
@@ -31,6 +37,13 @@ done
 command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; }
 mkdir -p "${OUTDIR}"
 
+# Cache of candidate PRs already classified as non-audio, so reruns don't
+# re-fetch their file lists. Audio PRs are remembered by their pr<N>_gh.json.
+NONAUDIO_CACHE="${OUTDIR}/_non_audio_prs.txt"
+touch "${NONAUDIO_CACHE}"
+declare -A IS_NONAUDIO=()
+while IFS= read -r x; do [[ -n "${x}" ]] && IS_NONAUDIO["${x}"]=1; done < "${NONAUDIO_CACHE}"
+
 # Audio-path test (extended regex over a PR's changed-file list).
 AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))'
 
@@ -51,6 +64,15 @@ echo "candidates after #${SINCE}: ${#CANDIDATES[@]}" >&2
 
 AUDIO_NUMS=()
 for n in "${CANDIDATES[@]}"; do
+    # Incremental: reuse prior classification unless --refresh.
+    if [[ ${REFRESH} -eq 0 ]]; then
+        if [[ -f "${OUTDIR}/pr${n}_gh.json" ]]; then
+            AUDIO_NUMS+=("${n}"); echo "  pr${n}: AUDIO (on disk, skip)" >&2; continue
+        fi
+        if [[ -n "${IS_NONAUDIO[$n]:-}" ]]; then
+            echo "  pr${n}: non-audio (cached, skip)" >&2; continue
+        fi
+    fi
     files_json="${OUTDIR}/pr${n}_files.json"
     gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \
         > "${files_json}" 2>/dev/null || { echo "  pr${n}: files fetch failed, skip" >&2; continue; }
@@ -65,13 +87,26 @@ PY
         echo "  pr${n}: AUDIO" >&2
     else
         rm -f "${files_json}"
+        if [[ -z "${IS_NONAUDIO[$n]:-}" ]]; then
+            echo "${n}" >> "${NONAUDIO_CACHE}"; IS_NONAUDIO["${n}"]=1
+        fi
     fi
 done
 
 echo "audio PRs after #${SINCE}: ${#AUDIO_NUMS[@]}" >&2
 printf '%s\n' "${AUDIO_NUMS[@]}" > "${OUTDIR}/_audio_pr_numbers.txt"
 
+pulled=0
+skipped=0
 for n in "${AUDIO_NUMS[@]}"; do
+    # Incremental: skip PRs whose data is already on disk unless --refresh.
+    if [[ ${REFRESH} -eq 0 && -f "${OUTDIR}/pr${n}_gh.json" \
+          && -f "${OUTDIR}/pr${n}_reviews.json" \
+          && -f "${OUTDIR}/pr${n}_review_comments.json" \
+          && -f "${OUTDIR}/pr${n}_issue_comments.json" ]]; then
+        echo "--- pr${n}: already on disk, skip (use --refresh to re-pull) ---" >&2
+        skipped=$((skipped + 1)); continue
+    fi
     echo "--- pulling pr${n} reviews/comments ---" >&2
     gh pr view "${n}" --repo "${REPO}" \
         --json number,title,state,author,createdAt,updatedAt,mergedAt,closedAt,url,body \
@@ -79,6 +114,7 @@ for n in "${AUDIO_NUMS[@]}"; do
     gh api --paginate "repos/${REPO}/pulls/${n}/reviews"  > "${OUTDIR}/pr${n}_reviews.json"
     gh api --paginate "repos/${REPO}/pulls/${n}/comments" > "${OUTDIR}/pr${n}_review_comments.json"
     gh api --paginate "repos/${REPO}/issues/${n}/comments" > "${OUTDIR}/pr${n}_issue_comments.json"
+    pulled=$((pulled + 1))
 done
 
-echo "AUDIO_PR_CORPUS_PULL_DONE  outdir=${OUTDIR}  audio_prs=${#AUDIO_NUMS[@]}" >&2
+echo "AUDIO_PR_CORPUS_PULL_DONE  outdir=${OUTDIR}  audio_prs=${#AUDIO_NUMS[@]}  pulled=${pulled}  skipped=${skipped}  refresh=${REFRESH}" >&2
diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md
index a9cc9548f9..16af214cb0 100644
--- a/.cursor/skills/review-curator-audio-pr/templates.md
+++ b/.cursor/skills/review-curator-audio-pr/templates.md
@@ -14,7 +14,7 @@ overview** (what the change does and why), then PR state, the diff, and all
 existing review activity. Complete the overview's plain-language summary first
 and present it before any findings; add your **Findings** section at the end.
 
-```markdown
+````markdown
 # Curator PR <N> Fresh Review - <YYYY-MM-DD>
 
 Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N>
@@ -94,7 +94,7 @@ Total: <n> comments across <m> threads.
 
 ## Verdict
 <APPROVE / COMMENT / REQUEST CHANGES, and the blockers if any>
-```
+````
 
 ## B. Prior open threads (generated context)
 
@@ -104,7 +104,7 @@ adding your own. This is **context, not your output** - you are not replying to
 these as the author; you read them to avoid duplicating and to check whether the
 author addressed them.
 
-```markdown
+````markdown
 # Curator PR <N> Open Review Threads - <YYYY-MM-DD>
 
 Current PR head: `<headRefOid>`
@@ -121,7 +121,7 @@ Link: <html_url>
 
 ## Stale (outdated/resolved) threads
 - [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body>
-```
+````
 
 ## C. Writing your review comments
 
@@ -145,6 +145,14 @@ Example review comment:
 > Please also add a test under `tests/stages/audio/io/` covering
 > `keep_waveform=True`.
 
+When you have an exact fix, attach a GitHub `suggestion` block so the author can
+apply it in one click (anchor the comment to the line(s) it replaces):
+
+```suggestion
+        record = {k: v for k, v in record.items() if k != "waveform"}
+        line = json.dumps(record)
+```
+
 Then post the set as one PR review: inline comments for line-specific findings,
 a top-level summary carrying the overall verdict (APPROVE / COMMENT / REQUEST
 CHANGES).

From 5a930a7b8a4c6b74eed8089fa354a18f49c5e363 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Tue, 9 Jun 2026 18:17:46 +0530
Subject: [PATCH 09/10] Require post-#1608 audio corpus build before every PR
 review.

Make corpus pull + build_corpus mandatory in the skill workflow so reviewers
cannot skip cross-PR pattern grounding when reviewing audio Curator PRs.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 69 ++++++++++++-------
 .../knowledge-sources.md                      | 21 +++---
 .../review-curator-audio-pr/scripts/README.md |  8 ++-
 3 files changed, 63 insertions(+), 35 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index 241a61a29e..df15ee36fb 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -3,10 +3,11 @@ name: review-curator-audio-pr
 description: >-
   Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is
   a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo,
-  pulls the PR, diffs the changed audio code, applies Curator's audio stage
-  contracts and contribution standards, shows what other reviewers have already
-  raised (so you don't duplicate), and helps you produce a structured set of
-  review findings (P0-P3) to post as review comments. Use when you are assigned
+  pulls the PR, builds the post-#1608 audio review corpus, diffs the changed
+  audio code, applies Curator's audio stage contracts and contribution standards,
+  shows what other reviewers have already raised (so you don't duplicate), and
+  helps you produce a structured set of review findings (P0-P3) to post as
+  review comments. Use when you are assigned
   or pick up an audio Curator PR and need to review it (e.g. "review audio PR
   1967", "do a review of this Curator audio PR", "what should I flag on PR
   1898"). It is NOT for a PR author responding to a review.
@@ -35,17 +36,18 @@ you already have and only shallow-clones
 A human reviewer invokes this skill with a prompt like:
 
 > **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or
-> shallow-clone the repo, pull the PR and prior review activity, read the
-> knowledge sources, then give me a detailed overview of what the PR does
+> shallow-clone the repo, pull the PR and prior review activity, build the
+> post-#1608 audio review corpus, read the knowledge sources, then give me a
+> detailed overview of what the PR does
 > followed by my own findings (P0-P3) with exact `path:line` evidence and
 > concrete fixes, ready to post as PR comments. Skip anything other reviewers
 > already raised."**
 
 Variants: *"re-review PR <N> and show only what changed since the last review"*,
 *"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and
-review the smallest one first"*. First-time in a repo, you can prime context
-with *"build the post-#1608 audio review corpus first, then review PR <N>"*
-(see [knowledge-sources.md](knowledge-sources.md) section 4).
+review the smallest one first"*. Every review run must build (or refresh) the
+post-#1608 audio review corpus before findings (see
+[knowledge-sources.md](knowledge-sources.md) section 4).
 
 ## What you produce
 
@@ -80,11 +82,12 @@ review comments. The review lenses and every doc/code reference live in
 - [ ] 0. Locate or shallow-clone the Curator repo
 - [ ] 1. Identify the PR and confirm it touches audio paths
 - [ ] 2. Pull fresh GitHub data (gh)
-- [ ] 3. Read the diff
-- [ ] 4. Read the knowledge sources (canonical docs + review lenses) to ground yourself
-- [ ] 5. Generate the digest + prior-threads file for context
-- [ ] 6. Write the detailed PR overview - only after steps 3-4 - present this first
-- [ ] 7. Review through the lenses, write up findings (P0-P3), and post them after the overview
+- [ ] 3. Build the post-#1608 audio review corpus (required)
+- [ ] 4. Read the diff
+- [ ] 5. Read the knowledge sources (canonical docs + review lenses + corpus) to ground yourself
+- [ ] 6. Generate the digest + prior-threads file for context
+- [ ] 7. Write the detailed PR overview - only after steps 4-5 - present this first
+- [ ] 8. Review through the lenses, write up findings (P0-P3), and post them after the overview
 ```
 
 ### Step 0 - Locate or shallow-clone the repo
@@ -119,7 +122,24 @@ Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
 `isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this
 to see prior review activity, not because you own the PR.
 
-### Step 3 - Read the diff
+### Step 3 - Build the post-#1608 audio review corpus (required)
+
+Before you explain or judge anything, build the consolidated reviewer-comment
+corpus so you can spot cross-PR repeated mistakes. Do **not** skip this step,
+even if a prior corpus file exists on disk - rerun the pull (incremental by
+default) and render a fresh consolidated file for today's review:
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD>
+```
+
+Confirm `.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md` exists and
+read it before writing findings. If the pull scripts fail, stop and fix the
+failure - do not proceed with a review that skipped the corpus. See
+[knowledge-sources.md](knowledge-sources.md) section 4 for details.
+
+### Step 4 - Read the diff
 
 ```bash
 gh pr diff <N> --repo NVIDIA-NeMo/Curator        # works on a shallow clone
@@ -130,7 +150,7 @@ gh pr diff <N> --repo NVIDIA-NeMo/Curator        # works on a shallow clone
 shallow clone from step 0. Always review the actual diff, never the comment text
 alone. Use `main` as the baseline for "how the audio stages already do this".
 
-### Step 4 - Read the knowledge sources (ground yourself first)
+### Step 5 - Read the knowledge sources (ground yourself first)
 
 Before you explain or judge anything, read
 [knowledge-sources.md](knowledge-sources.md) so your overview and findings are
@@ -140,14 +160,15 @@ grounded in the canonical material, not guesswork:
   (`nemo_curator/stages/audio/README.md`), the developer-guide slides, the
   published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`.
 - **section 1** - the audio code map, so you know where each changed file sits.
-- **section 2** - the review lenses (you apply these in step 7).
+- **section 2** - the review lenses (you apply these in step 8).
+- **section 4 + corpus file** - the post-#1608 reviewer-comment corpus from step 3.
 
 Open the specific sources the diff touches (e.g. the fern page and the code for
-the stage being changed). First time reviewing audio in this repo, optionally
-build the post-#1608 reviewer-comment corpus (section 4) to see what reviewers
-repeatedly raise. Only after this grounding do you write the overview (step 6).
+the stage being changed). Check the corpus for recurring themes and whether this
+PR repeats feedback already raised on other audio PRs. Only after this grounding
+do you write the overview (step 7).
 
-### Step 5 - Generate the context files
+### Step 6 - Generate the context files
 
 ```bash
 .cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
@@ -158,9 +179,9 @@ status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head),
 **RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer
 already made.
 
-### Step 6 - Summarize the PR in detail (present this first)
+### Step 7 - Summarize the PR in detail (present this first)
 
-Now that you have read the diff (step 3) and the knowledge sources (step 4),
+Now that you have read the diff (step 4) and the knowledge sources (step 5),
 write the **"What this PR does (overview)"** section of the digest so a reader
 understands the change end to end: the problem it solves; the
 main audio stages/files it adds or modifies (use the "Areas touched" table the
@@ -168,7 +189,7 @@ builder generates); key design decisions; new dependencies, config, or APIs; and
 the blast radius (what could regress). Lead your review with this overview - the
 author and other reviewers read it before any findings.
 
-### Step 7 - Review through the lenses, then write up and post your findings
+### Step 8 - Review through the lenses, then write up and post your findings
 
 Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the
 diff - each lens links the audio code, README section, and `.cursor/rules`
diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
index e2ef795727..5624b0d189 100644
--- a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
+++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
@@ -8,8 +8,9 @@ and has five parts:
 2. **Review lenses** - what to check, each linked to the audio code, README
    section, and `.cursor/rules` contract it governs.
 3. **Severity mapping** - P0-P3.
-4. **Pre-review corpus** - pull every audio PR after #1608 (open + closed) with
-   reviewer comments, so you learn the recurring feedback before you review.
+4. **Pre-review corpus (required)** - pull every audio PR after #1608 (open +
+   closed) with reviewer comments and read the consolidated file before you
+   write findings.
 5. **GitHub data + scripts reference** - the `gh` endpoints and helper scripts.
 
 All paths are repo-relative. The skill is self-contained: this repository plus
@@ -271,13 +272,14 @@ canonical sources to cite.
 
 ---
 
-## 4. Pre-review corpus: learn from post-#1608 audio PRs
+## 4. Pre-review corpus (required): learn from post-#1608 audio PRs
 
 PR [#1608](https://github.com/NVIDIA-NeMo/Curator/pull/1608) (`AudioBatch ->
 AudioTask` redesign) reset the audio stage contracts. Reviewer feedback on audio
-PRs **after** #1608 is the best predictor of what to flag next. Before reviewing
-a new PR - especially the first time you review audio in this repo - build the
-consolidated corpus of that feedback:
+PRs **after** #1608 is the best predictor of what to flag next. **Before every
+audio PR review**, build (or refresh) the consolidated corpus of that feedback.
+Do not skip this step, even when a prior `audio_pr_corpus_*.md` already exists
+on disk - rerun the pull (incremental by default) and render today's file:
 
 ```bash
 # 1) discover audio PRs after #1608 (open + closed/merged) and pull their
@@ -301,7 +303,8 @@ verbatim, anchored to `path:line`, plus a recurring-themes tally.
 Use the corpus to (a) recognize patterns reviewers repeatedly raise (the lenses
 in section 2 came from exactly this), and (b) check whether the PR in front of
 you repeats a mistake already called out elsewhere. It is read-only context; it
-never auto-posts anything. Re-run periodically to keep it current.
+never auto-posts anything. If the corpus scripts fail or the consolidated file
+is missing, stop the review and fix the failure before writing findings.
 
 ---
 
@@ -334,7 +337,7 @@ own findings).
 
 `scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]`
 (incremental - skips PRs already on disk; `--refresh` re-pulls) and
-`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` build the
-post-#1608 corpus (section 4).
+`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` **must** run before
+every review to build the post-#1608 corpus (section 4; SKILL.md step 3).
 
 Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete).
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md
index e210b0c062..dd40c15ef0 100644
--- a/.cursor/skills/review-curator-audio-pr/scripts/README.md
+++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md
@@ -50,11 +50,15 @@ The thread join uses comment `databaseId` when present, else a
 (path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
 correctly. It also re-checks the audio-only guard and aborts on a non-audio PR.
 
-## pull_audio_pr_corpus.sh + build_corpus.py (pre-review learning)
+## pull_audio_pr_corpus.sh + build_corpus.py (required pre-review corpus)
+
+Run these **before every audio PR review** (SKILL.md step 3). Do not skip even
+when `.curator-pr-review/audio-corpus/` already has data - the pull is
+incremental and `build_corpus.py` must render today's consolidated file.
 
 ```bash
 .cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
-.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD>
 ```
 
 `pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are

From cfd4994495cfb608a66e181d82ca27f4a63ff3c0 Mon Sep 17 00:00:00 2001
From: "aaftaabv@gmail.com" <aaftaabv@gmail.com>
Date: Tue, 9 Jun 2026 22:02:39 +0530
Subject: [PATCH 10/10] review-curator-audio-pr: document how to provide a
 gh-authenticated environment

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
---
 .../skills/review-curator-audio-pr/SKILL.md   | 25 ++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
index df15ee36fb..4adb8ff4c1 100644
--- a/.cursor/skills/review-curator-audio-pr/SKILL.md
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -26,9 +26,23 @@ Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`,
 PR touches no audio path, the pull step refuses to run (step 2) - review it
 with a modality-appropriate tool instead.
 
-Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and
-`git`. You do **not** need to pre-clone the repo - step 0 reuses any checkout
-you already have and only shallow-clones
+Requirements: an environment with `git` and the GitHub CLI (`gh`) **already
+authenticated** against `github.com`. Everything the skill does goes through
+`gh`/`git`, so a working `gh` is the only credential you need. Verify with `gh
+auth status`; if it is not green, give it an authenticated environment one of
+these ways before invoking the skill:
+
+- **Interactive:** run `gh auth login` (choose GitHub.com, HTTPS) once; the token
+  is cached for future runs.
+- **Token (CI / headless / sandbox):** export `GH_TOKEN` (or `GITHUB_TOKEN`) with
+  a PAT that has `repo` scope - e.g. `export GH_TOKEN=ghp_...` - then `gh auth
+  status` should pass. This is the way to hand a remote/agent runner a
+  `gh`-configured environment.
+- **Existing config:** if you already have a `~/.config/gh/hosts.yml` (or a host
+  with `gh` set up), run the skill from that shell/user so it inherits the auth.
+
+You do **not** need to pre-clone the repo - step 0 reuses any checkout you
+already have and only shallow-clones
 [NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found.
 
 ## The reviewer prompt
@@ -49,6 +63,11 @@ review the smallest one first"*. Every review run must build (or refresh) the
 post-#1608 audio review corpus before findings (see
 [knowledge-sources.md](knowledge-sources.md) section 4).
 
+Run the prompt in an environment where `gh auth status` passes. For a
+remote/headless agent, say so explicitly, e.g. *"`gh` is authenticated via
+`GH_TOKEN` in this environment - review audio Curator PR <N> ..."* (see
+Requirements above for how to provide a `gh`-configured environment).
+
 ## What you produce
 
 You produce two things, **in this order**: (1) a **detailed overview of what the