diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md
new file mode 100644
index 0000000000..4adb8ff4c1
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/SKILL.md
@@ -0,0 +1,257 @@
+---
+name: review-curator-audio-pr
+description: >-
+  Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is
+  a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo,
+  pulls the PR, builds the post-#1608 audio review corpus, diffs the changed
+  audio code, applies Curator's audio stage contracts and contribution standards,
+  shows what other reviewers have already raised (so you don't duplicate), and
+  helps you produce a structured set of review findings (P0-P3) to post as
+  review comments. Use when you are assigned
+  or pick up an audio Curator PR and need to review it (e.g. "review audio PR
+  1967", "do a review of this Curator audio PR", "what should I flag on PR
+  1898"). It is NOT for a PR author responding to a review.
+---
+
+# Review NeMo Curator audio PRs
+
+**Use this when you are the reviewer and need to review someone else's audio
+Curator PR.** Given a PR number, it helps you understand the change, see what
+other reviewers already said, and write your own review. It does not help a PR
+author reply to reviewers - it helps you *produce* the review.
+
+Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`,
+`nemo_curator/tasks/audio_task.py`, `tutorials/audio/`, audio tests under
+`tests/stages/audio/`, and audio benchmarks. This skill is **audio-only**: if a
+PR touches no audio path, the pull step refuses to run (step 2) - review it
+with a modality-appropriate tool instead.
+
+Requirements: an environment with `git` and the GitHub CLI (`gh`) **already
+authenticated** against `github.com`. Everything the skill does goes through
+`gh`/`git`, so a working `gh` is the only credential you need. Verify with `gh
+auth status`; if it is not green, give it an authenticated environment one of
+these ways before invoking the skill:
+
+- **Interactive:** run `gh auth login` (choose GitHub.com, HTTPS) once; the token
+  is cached for future runs.
+- **Token (CI / headless / sandbox):** export `GH_TOKEN` (or `GITHUB_TOKEN`) with
+  a PAT that has `repo` scope - e.g. `export GH_TOKEN=ghp_...` - then `gh auth
+  status` should pass. This is the way to hand a remote/agent runner a
+  `gh`-configured environment.
+- **Existing config:** if you already have a `~/.config/gh/hosts.yml` (or a host
+  with `gh` set up), run the skill from that shell/user so it inherits the auth.
+
+You do **not** need to pre-clone the repo - step 0 reuses any checkout you
+already have and only shallow-clones
+[NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found.
+
+## The reviewer prompt
+
+A human reviewer invokes this skill with a prompt like:
+
+> **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or
+> shallow-clone the repo, pull the PR and prior review activity, build the
+> post-#1608 audio review corpus, read the knowledge sources, then give me a
+> detailed overview of what the PR does
+> followed by my own findings (P0-P3) with exact `path:line` evidence and
+> concrete fixes, ready to post as PR comments. Skip anything other reviewers
+> already raised."**
+
+Variants: *"re-review PR <N> and show only what changed since the last review"*,
+*"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and
+review the smallest one first"*. Every review run must build (or refresh) the
+post-#1608 audio review corpus before findings (see
+[knowledge-sources.md](knowledge-sources.md) section 4).
+
+Run the prompt in an environment where `gh auth status` passes. For a
+remote/headless agent, say so explicitly, e.g. *"`gh` is authenticated via
+`GH_TOKEN` in this environment - review audio Curator PR <N> ..."* (see
+Requirements above for how to provide a `gh`-configured environment).
+
+## What you produce
+
+You produce two things, **in this order**: (1) a **detailed overview of what the
+PR does and why** - written only after you have read the knowledge sources (the
+canonical docs and review lenses), so the reader understands the change before
+any critique - then (2) your **findings** (P0-P3), each tied to a `path:line` on
+the PR's current head, that you post as PR review comments. Never lead with
+review comments; ground yourself in the knowledge sources, then explain the PR,
+then give findings.
+
+The skill writes two helper files into a scratch directory (default
+`.curator-pr-review/`, override with `--outdir`) to support that:
+
+1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: leads
+   with a **"What this PR does (overview)"** section (the author's description,
+   an areas-touched table, and a plain-language summary you write), then PR
+   state, commits, changed-file table, and every existing review/comment grouped
+   by file with OPEN / OUTDATED / RESOLVED status, then a placeholder for your
+   findings. Fill in the overview first, then record your findings at the bottom.
+2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**:
+   a condensed list of the threads other reviewers left that are still
+   unresolved on the current head. Scan it before writing your own comments.
+
+See [templates.md](templates.md) for the exact layout and how to phrase your
+review comments. The review lenses and every doc/code reference live in
+[knowledge-sources.md](knowledge-sources.md).
+
+## Workflow
+
+```
+- [ ] 0. Locate or shallow-clone the Curator repo
+- [ ] 1. Identify the PR and confirm it touches audio paths
+- [ ] 2. Pull fresh GitHub data (gh)
+- [ ] 3. Build the post-#1608 audio review corpus (required)
+- [ ] 4. Read the diff
+- [ ] 5. Read the knowledge sources (canonical docs + review lenses + corpus) to ground yourself
+- [ ] 6. Generate the digest + prior-threads file for context
+- [ ] 7. Write the detailed PR overview - only after steps 4-5 - present this first
+- [ ] 8. Review through the lenses, write up findings (P0-P3), and post them after the overview
+```
+
+### Step 0 - Locate or shallow-clone the repo
+
+Don't clone unnecessarily. This helper checks whether you are already inside a
+Curator checkout, searches the current directory tree for one, and only then
+shallow-clones (`--depth 1`, no full history):
+
+```bash
+eval "$(scripts/ensure_repo.sh | tail -1)"   # sets CURATOR_REPO=<path>
+cd "$CURATOR_REPO"
+```
+
+If you are already in the checkout that contains this skill, you can skip this.
+
+### Step 1 - Identify the PR
+
+`gh pr view <N> --repo NVIDIA-NeMo/Curator`. This skill is **audio-only**: the
+pull step (step 2) aborts if the PR touches no audio path
+(`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`,
+`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks). If it aborts, the
+PR is out of scope for this skill, so stop and do not review it here.
+
+### Step 2 - Pull fresh GitHub data
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N>
+```
+
+Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
+`isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this
+to see prior review activity, not because you own the PR.
+
+### Step 3 - Build the post-#1608 audio review corpus (required)
+
+Before you explain or judge anything, build the consolidated reviewer-comment
+corpus so you can spot cross-PR repeated mistakes. Do **not** skip this step,
+even if a prior corpus file exists on disk - rerun the pull (incremental by
+default) and render a fresh consolidated file for today's review:
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD>
+```
+
+Confirm `.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md` exists and
+read it before writing findings. If the pull scripts fail, stop and fix the
+failure - do not proceed with a review that skipped the corpus. See
+[knowledge-sources.md](knowledge-sources.md) section 4 for details.
+
+### Step 4 - Read the diff
+
+```bash
+gh pr diff <N> --repo NVIDIA-NeMo/Curator        # works on a shallow clone
+# optional, to run/inspect locally: gh pr checkout <N>
+```
+
+`gh pr diff` fetches the patch straight from GitHub, so it works even on the
+shallow clone from step 0. Always review the actual diff, never the comment text
+alone. Use `main` as the baseline for "how the audio stages already do this".
+
+### Step 5 - Read the knowledge sources (ground yourself first)
+
+Before you explain or judge anything, read
+[knowledge-sources.md](knowledge-sources.md) so your overview and findings are
+grounded in the canonical material, not guesswork:
+
+- **section 0** - canonical docs: the audio stage developer guide
+  (`nemo_curator/stages/audio/README.md`), the developer-guide slides, the
+  published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`.
+- **section 1** - the audio code map, so you know where each changed file sits.
+- **section 2** - the review lenses (you apply these in step 8).
+- **section 4 + corpus file** - the post-#1608 reviewer-comment corpus from step 3.
+
+Open the specific sources the diff touches (e.g. the fern page and the code for
+the stage being changed). Check the corpus for recurring themes and whether this
+PR repeats feedback already raised on other audio PRs. Only after this grounding
+do you write the overview (step 7).
+
+### Step 6 - Generate the context files
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
+```
+
+The builder joins the pulled JSON and classifies each existing comment by thread
+status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head),
+**RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer
+already made.
+
+### Step 7 - Summarize the PR in detail (present this first)
+
+Now that you have read the diff (step 4) and the knowledge sources (step 5),
+write the **"What this PR does (overview)"** section of the digest so a reader
+understands the change end to end: the problem it solves; the
+main audio stages/files it adds or modifies (use the "Areas touched" table the
+builder generates); key design decisions; new dependencies, config, or APIs; and
+the blast radius (what could regress). Lead your review with this overview - the
+author and other reviewers read it before any findings.
+
+### Step 8 - Review through the lenses, then write up and post your findings
+
+Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the
+diff - each lens links the audio code, README section, and `.cursor/rules`
+contract it governs: stage contracts, setup/teardown lifecycle, audio
+optional-dependency hygiene (vLLM, NeMo, model utils), secret-safe logging,
+waveform/tensor memory and manifest serialization, tarred/sharded audio I/O,
+sample-rate and metadata propagation, streaming/throughput, tutorials/docs,
+tests/coverage, and PR reviewability. Cite the linked source by name whenever
+the PR deviates.
+
+Record each issue as a finding, classified by severity, and post them **after
+the overview** as a PR review (inline comments for specific lines, a top-level
+summary for the overall verdict). Each finding cites the exact `path:line-range`
+on the current head and proposes a concrete fix.
+
+- **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a
+  manifest writer that crashes on a kept waveform), a secret (HF token) leaked
+  into logs, or an import that breaks a supported install.
+- **P1** - should fix before merge: undeclared audio runtime dependencies,
+  missing required tests/coverage, environment-specific I/O in a reusable
+  reader, a PR too large to review safely.
+- **P2** - worth fixing: duplicated pipeline construction, inefficient per-task
+  writer batching, incomplete throughput metrics, documented-but-unwired config.
+- **P3** - nice to have: temporary model/compat workarounds needing a test +
+  comment, debug-only knobs (e.g. per-shard utterance caps) to document.
+
+See [templates.md](templates.md) section C for how to phrase review comments.
+
+## What to check the PR against (CONTRIBUTING.md and rules)
+
+A compliant audio PR must satisfy these; flag any that are missing:
+
+- **DCO sign-off** on every commit (`git commit -s`).
+- **Tests** accompany source changes; GPU-only audio tests use
+  `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
+  and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`.
+- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
+- **Ruff** lint/format, 119-char lines; **loguru** for logging.
+- Supported Python: 3.10-3.12.
+
+## Knowledge sources
+
+[knowledge-sources.md](knowledge-sources.md) is the single index: canonical docs
+(audio README, developer-guide slides, `.cursor/rules`, tutorials, benchmarks),
+the audio code map, the review lenses with per-concept code references, the
+post-#1608 PR corpus workflow, and the GitHub/`gh` data reference.
diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
new file mode 100644
index 0000000000..5624b0d189
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md
@@ -0,0 +1,343 @@
+# Audio PR review knowledge sources
+
+Read this before reviewing. It is the single index for an audio Curator review
+and has five parts:
+
+0. **Canonical docs** - the authoritative references to ground every finding.
+1. **Audio code map** - where each audio stage lives.
+2. **Review lenses** - what to check, each linked to the audio code, README
+   section, and `.cursor/rules` contract it governs.
+3. **Severity mapping** - P0-P3.
+4. **Pre-review corpus (required)** - pull every audio PR after #1608 (open +
+   closed) with reviewer comments and read the consolidated file before you
+   write findings.
+5. **GitHub data + scripts reference** - the `gh` endpoints and helper scripts.
+
+All paths are repo-relative. The skill is self-contained: this repository plus
+the GitHub CLI (`gh`). Scope: the **audio** modality.
+
+---
+
+## 0. Canonical docs (start here)
+
+| Source | What it gives you |
+|--------|-------------------|
+| `nemo_curator/stages/audio/README.md` | The audio stage **developer guide** (1000+ lines): CPU vs GPU stages, `process` / `process_batch`, `batch_size`, call chains, per-stage memory characteristics, end-to-end FLEURS & ALM traces, new-stage checklist. Read first. |
+| Audio curation developer guide (slides): <https://docs.google.com/presentation/d/15lJHyAoRTbNFDWq8UJ6czMaUFck3WYPCRaMYPfO_qew/edit?slide=id.g3be823d0e3d_0_0#slide=id.g3be823d0e3d_0_0> | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). |
+| `nemo_curator/tasks/audio_task.py` | `AudioTask(Task[dict])` + `_AttrDict` task model (one manifest entry per task; dict keys exposed as attributes for `validate_input`). |
+| `nemo_curator/stages/base.py` | `ProcessingStage` contract: `inputs()`/`outputs()`, `process`/`process_batch`, `setup_on_node`/`setup`/`teardown`, `validate_input`. |
+| `CONTRIBUTING.md` | DCO sign-off, PR process, required tests, coverage threshold, copyright header. |
+| `.cursor/rules/*.mdc` | Always-on engineering contracts (table below). |
+| `tutorials/audio/README.md` + per-pipeline READMEs | Runnable reference pipelines (alm, audio_pretrain, callhome_diar, fleurs, readspeech, single_speaker_filter, tagging). |
+| `benchmarking/README.md`, `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | How audio/ALM benchmarks are wired and the perf expectations to hold a PR to. |
+| `tests/stages/audio/` | Test layout that mirrors `nemo_curator/stages/audio/`; the coverage bar. |
+
+### `.cursor/rules/` contracts
+
+| Rule | Covers |
+|------|--------|
+| `.cursor/rules/processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
+| `.cursor/rules/composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) |
+| `.cursor/rules/task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
+| `.cursor/rules/executors.mdc` | Xenna / Ray Data backends and when to use each |
+| `.cursor/rules/resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
+| `.cursor/rules/pipeline-structure.mdc` | Pipeline assembly |
+| `.cursor/rules/modality-structure.mdc` | Where modality code/tests/tutorials live |
+| `.cursor/rules/coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
+
+### Published audio documentation (fern docs site)
+
+The user-facing audio docs live under `fern/versions/<ver>/pages/` (latest
+versioned snapshot: **v26.04**; `fern/versions/main/` holds in-progress docs).
+These are the canonical "what the docs promise" reference - cite them when a PR's
+behavior, config, or API drifts from documented behavior. Paths below are
+relative to `fern/versions/v26.04/pages/`.
+
+| Audio concept / stage | Fern doc(s) | Code it documents |
+|------------------------|-------------|-------------------|
+| Overview / get started | `get-started/audio.mdx`, `curate-audio/index.mdx` | - |
+| Concepts (task model, pipelines, metrics) | `about/concepts/audio/{index,audio-task,curation-pipeline,asr-pipeline,alm-pipeline,manifests-ingest,quality-metrics,text-integration}.mdx` | `nemo_curator/tasks/audio_task.py` |
+| Load data / manifests | `curate-audio/load-data/{index,fleurs-dataset,custom-manifests,local-files}.mdx` | `nemo_curator/stages/audio/datasets/`, `.../io/` |
+| ASR inference | `curate-audio/process-data/asr-inference/{index,nemo-models}.mdx` | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` |
+| Audio analysis (duration, format) | `curate-audio/process-data/audio-analysis/{index,duration-calculation,format-validation}.mdx` | `nemo_curator/stages/audio/common.py` |
+| Quality filtering (band, VAD, SigMOS, UTMOS, speaker separation, preprocessing, data-filter) | `curate-audio/process-data/quality-filtering/{index,band-filter,vad,sigmos,utmos,speaker-separation,preprocessing,audio-data-filter-stage}.mdx` | `.../filtering/`, `.../segmentation/`, `.../preprocessing/`, `.../advanced_pipelines/` |
+| Quality assessment (WER, duration) | `curate-audio/process-data/quality-assessment/{index,wer-filtering,duration-filtering}.mdx` | `nemo_curator/stages/audio/metrics/wer.py`, `.../common.py` |
+| ALM (data builder, overlap filtering) | `curate-audio/process-data/alm/{index,data-builder,overlap-filtering}.mdx` | `nemo_curator/stages/audio/alm/` |
+| Text integration / tagging | `curate-audio/process-data/text-integration/index.mdx` | `nemo_curator/stages/audio/tagging/` |
+| Save / export | `curate-audio/save-export.mdx` | `nemo_curator/stages/audio/io/convert.py` |
+| Tutorials | `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx` | `tutorials/audio/` |
+| API reference | `api-reference/tasks/audio-task.mdx`, `api-reference/processing-stage.mdx` | `nemo_curator/tasks/audio_task.py`, `nemo_curator/stages/base.py` |
+| Infrastructure (memory, per-stage runtime, backends, GPU, resumable, monitoring) | `reference/infrastructure/{memory-management,per-stage-runtime,execution-backends,gpu-processing,resumable-processing,monitoring}.mdx` | `nemo_curator/backends/` |
+| Release notes / migration (`AudioBatch -> AudioTask`, #1608) | `about/release-notes/{index,migration-guide,migration-faq}.mdx` | see PR #1608 |
+
+When a PR changes audio behavior, check whether the matching fern page above
+needs updating in the same PR - docs drift is a common review finding.
+
+---
+
+## 1. Audio code map (diff ground truth)
+
+`nemo_curator/stages/audio/` is organized as:
+
+| Area | Path(s) |
+|------|---------|
+| Task model | `nemo_curator/tasks/audio_task.py` |
+| Base contract | `nemo_curator/stages/base.py` |
+| Common CPU stages (duration, value filter) | `nemo_curator/stages/audio/common.py` |
+| I/O (convert to DocumentBatch, segment extraction) | `nemo_curator/stages/audio/io/convert.py`, `nemo_curator/stages/audio/io/extract_segments.py` |
+| Preprocessing | `nemo_curator/stages/audio/preprocessing/mono_conversion.py`, `nemo_curator/stages/audio/preprocessing/concatenation.py` |
+| Inference - ASR | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` |
+| Inference - VAD | `nemo_curator/stages/audio/inference/vad/whisperx_vad.py` |
+| Inference - speaker diarization | `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py` |
+| Filtering (band, SigMOS, UTMOS) | `nemo_curator/stages/audio/filtering/band.py`, `nemo_curator/stages/audio/filtering/sigmos.py`, `nemo_curator/stages/audio/filtering/utmos.py` |
+| Segmentation | `nemo_curator/stages/audio/segmentation/vad_segmentation.py`, `nemo_curator/stages/audio/segmentation/speaker_separation.py` |
+| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py`, `nemo_curator/stages/audio/tagging/merge_alignment_diarization.py`, `nemo_curator/stages/audio/tagging/resample_audio.py`, `nemo_curator/stages/audio/tagging/split.py`, `nemo_curator/stages/audio/tagging/text/itn.py` |
+| Postprocessing | `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py` |
+| Metrics / throughput | `nemo_curator/stages/audio/metrics/wer.py`, `nemo_curator/stages/audio/metrics/bandwidth.py`, `nemo_curator/stages/audio/metrics/squim.py` |
+| Datasets (FLEURS, ReadSpeech) | `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/readspeech/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/file_utils.py` |
+| ALM (audio-language model data) | `nemo_curator/stages/audio/alm/alm_data_builder.py`, `nemo_curator/stages/audio/alm/alm_data_overlap.py`, `nemo_curator/stages/audio/alm/pretrain/pipeline.py`, `nemo_curator/stages/audio/alm/pretrain/io.py` |
+| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/audio_data_filter/audio_data_filter.py` |
+
+Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`,
+`ray_data/`, `ray_actor_pool/`); pipeline assembly:
+`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`.
+
+---
+
+## 2. Review lenses (each with code + doc references)
+
+Apply each lens to the diff and cite exact `path:line` evidence. When a change
+violates a contract, link it by name. The **References** under each lens are the
+canonical sources to cite.
+
+### 2.1 Stage contracts and reuse
+
+- An audio stage subclasses `ProcessingStage[AudioTask, AudioTask]` (or the
+  appropriate in/out task types) with explicit `inputs()` / `outputs()`, a
+  meaningful `name`, and appropriate `resources` / `batch_size` (set as class
+  attributes / dataclass fields, never the read-only `_name` / `_resources` /
+  `_batch_size` properties). Field order follows the base convention: `name`
+  first, then params, then `resources`, then `batch_size` last.
+- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in
+  `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it
+  runs on the driver). Batch-only stages raise `NotImplementedError` from
+  `process()` (the convention used by dedup stages).
+- Reuse existing audio readers, manifest writers, and shared model abstractions
+  instead of bespoke orchestration. Model a tarred/sharded reader as a
+  `CompositeStage` = a discovery stage + a shard-reader fan-out.
+- If `inputs()` does not declare a required key (e.g. `audio_filepath_key`,
+  `duration_key`), runtime validation won't check it, and a direct
+  `task.data[key]` access then raises an opaque `KeyError` instead of a clear
+  validation error. Declare required keys.
+- Fan-out shard readers need deterministic child IDs and must propagate
+  `_metadata` / `_stage_perf`; declare `ray_stage_spec()` with
+  `IS_FANOUT_STAGE: True`. First-stage shard discovery from an empty input
+  should constrain placement (e.g. one worker per node) so discovery isn't
+  duplicated across workers.
+- **References:** `nemo_curator/stages/base.py`; `nemo_curator/tasks/audio_task.py`; `nemo_curator/stages/audio/common.py`; `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`; rules `processing-stage-patterns.mdc`, `composite-stage-patterns.mdc`, `task-patterns.mdc`; README sections "Writing a CPU stage" / "Writing a GPU stage".
+
+### 2.2 Setup / teardown lifecycle and audio dependencies
+
+- `setup_on_node()` = once-per-node prep (model download / shared cache / output
+  truncation). `setup()` = per-worker init (model load, GPU allocation).
+  `teardown()` frees GPU memory. Don't conflate them; prefer `setup_on_node` for
+  anything that must run exactly once (avoids `_setup_done`-style guards).
+- Import heavy/optional audio deps (NeMo, vLLM, whisperx, pyannote, soundfile,
+  model-specific utils) lazily inside `setup()` or behind a guarded import,
+  never unconditionally at module top level. A top-level `import <optional_pkg>`
+  breaks `import nemo_curator.stages.audio...` on any install without that extra
+  (including CPU-only / Mac/ARM).
+- Declare the full audio runtime dependency set in the appropriate
+  `pyproject.toml` optional extra (e.g. a CUDA extra); add an import smoke test.
+  Don't rely on an external Docker image to supply deps a reusable stage needs.
+- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
+  shared across modalities (`transformers`, `huggingface-hub`, `accelerate`);
+  prefer compatible ranges.
+- Expose model knobs like `cache_dir` and `trust_remote_code` as parameters
+  rather than hardcoding them.
+- **References:** `nemo_curator/stages/audio/inference/asr/asr_nemo.py` (download-on-node / load-on-worker split, `cache_dir`); `nemo_curator/stages/audio/inference/vad/whisperx_vad.py`; `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `pyannote.py`; `nemo_curator/stages/audio/filtering/{utmos,sigmos,band}.py`; `pyproject.toml`; rules `resources-configuration.mdc`; README "Stage lifecycle".
+
+### 2.3 Secret-safe logging
+
+- Never log a full resolved config; audio tutorials commonly accept an HF token
+  / cloud credentials. Redact secret-like keys before logging (token, password,
+  secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset,
+  and add a test asserting the secret value never appears in logged output.
+- **References:** `tutorials/audio/*/` (token/credential inputs); any stage that
+  logs configuration; rules `coding-standards.mdc` (loguru).
+
+### 2.4 Waveform/tensor memory and manifest serialization
+
+- Drop waveform arrays / audio tensors from `task.data` once a stage no longer
+  needs them. A leaked numpy `ndarray` / torch tensor that reaches a JSON
+  manifest writer raises `TypeError: Object of type ndarray is not JSON
+  serializable` and can fail an entire shard/batch, or cause OOM at scale. Strip
+  by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap
+  `json.dumps` so a stray non-serializable value reports the offending key.
+- ALM stages that replace the whole dict must `task.data.clear()` +
+  `task.data.update(result)` to preserve the `_AttrDict` type. Remember
+  `_AttrDict.__setattr__` routes writes into the dict, so never stamp private
+  fields onto a task's data dict.
+- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for
+  non-regular members even after `isfile()` - check before `.read()`. Handle
+  resampling / channel-count / dtype mismatches explicitly.
+- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics"; fern `reference/infrastructure/{memory-management,per-stage-runtime}.mdx`.
+
+### 2.5 Tarred / sharded audio I/O
+
+- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
+  cluster-local paths) out of the generic reader/writer. Make remote access
+  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
+  opener/resolver, and isolate environment-specific support behind a small,
+  documented, optional component. For piped/streamed downloads use failure-aware
+  flags + retries so truncated streams fail loudly.
+- Cache the fsspec filesystem object in `setup_on_node` rather than re-creating
+  it per `process()` call (a new HTTP connection per manifest entry is costly on
+  S3/GCS at scale).
+- Manifest/shard writers that loop per task inside `process_batch()` reopen the
+  output repeatedly; group by shard and open each output once per batch. Preserve
+  `.done`/checkpoint markers so runs are resumable.
+- **References:** `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/datasets/{fleurs,readspeech}/create_initial_manifest.py`; `nemo_curator/stages/audio/datasets/file_utils.py`; `nemo_curator/stages/audio/alm/pretrain/io.py`; rules `composite-stage-patterns.mdc`.
+
+### 2.6 Sample-rate and metadata propagation
+
+- Resampling/mono-conversion/segmentation stages must update the sample-rate and
+  duration keys they change and propagate `_metadata` / `_stage_perf`; downstream
+  stages that assume a fixed sample rate must declare it in `inputs()`.
+- **References:** `nemo_curator/stages/audio/preprocessing/mono_conversion.py`; `nemo_curator/stages/audio/tagging/resample_audio.py`; `nemo_curator/stages/audio/segmentation/vad_segmentation.py`; `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py`.
+
+### 2.7 Streaming, throughput, benchmarking
+
+- Streaming execution with per-stage resource configuration is the default model
+  for scalable audio pipelines.
+- Emit stage metrics through the supported metrics path, JSON-shaped. For
+  multi-node audio runs include worker/node/GPU identity and latency
+  distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just
+  totals/averages.
+- Standalone audio benchmark scripts belong in the `benchmarking/` flow with
+  config entries and comparable parameters - not freestanding.
+- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`; fern `reference/infrastructure/{execution-backends,per-stage-runtime,monitoring,gpu-processing}.mdx`.
+
+### 2.8 Tutorials and docs
+
+- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public
+  APIs only - no private APIs, no machine-local absolute paths, no committed
+  noisy notebook output, no executor mismatch between a notebook and its CLI.
+- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
+  an `examples/` script; extract one shared builder both call.
+- Every config key documented for users must actually reach a stage; validate
+  config-to-stage mapping and fail (or warn) on unused keys.
+- Don't commit generated analysis/scratch docs into the source tree.
+- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `tutorials/audio/readspeech/readspeech_tutorial.ipynb`; `nemo_curator/stages/audio/README.md`; fern `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx`.
+
+### 2.9 Tests and standards (CONTRIBUTING + coding-standards)
+
+- Source changes need accompanying tests; `tests/stages/audio/` mirrors
+  `nemo_curator/stages/audio/`. The project enforces 80% coverage under
+  `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures
+  under `tests/fixtures/audio/` and sample data under `tests/data/audio/`.
+- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
+- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
+- Every commit is DCO signed-off.
+- **References:** `CONTRIBUTING.md`; rules `coding-standards.mdc`, `modality-structure.mdc`; `tests/stages/audio/`.
+
+### 2.10 PR size and reviewability
+
+- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by
+  stage / module / tutorial / benchmark. If a single PR must remain, organize it
+  as if sliced: no duplicated builder logic, clear module ownership, a reviewable
+  commit sequence.
+
+### Smaller recurring nits
+
+- Missing copyright headers on new files.
+- Top-level imports that should be lazy/guarded (very common on audio model modules).
+- Long inline prompt/config strings that should be module constants.
+- Dataclass fields validated in `__post_init__` that could just be typed.
+- Dead conditionals / redundant truthiness checks.
+- Pytests added for tutorials that don't warrant them.
+- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features - document them as debug throttles.
+- `trust_remote_code=True` hardcoded on a model load - expose as a parameter.
+
+---
+
+## 3. Severity mapping
+
+| Severity | Meaning |
+|----------|---------|
+| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install |
+| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large |
+| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config |
+| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs |
+
+---
+
+## 4. Pre-review corpus (required): learn from post-#1608 audio PRs
+
+PR [#1608](https://github.com/NVIDIA-NeMo/Curator/pull/1608) (`AudioBatch ->
+AudioTask` redesign) reset the audio stage contracts. Reviewer feedback on audio
+PRs **after** #1608 is the best predictor of what to flag next. **Before every
+audio PR review**, build (or refresh) the consolidated corpus of that feedback.
+Do not skip this step, even when a prior `audio_pr_corpus_*.md` already exists
+on disk - rerun the pull (incremental by default) and render today's file:
+
+```bash
+# 1) discover audio PRs after #1608 (open + closed/merged) and pull their
+#    reviews + inline comments into .curator-pr-review/audio-corpus/
+.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
+
+# 2) render one consolidated, reviewer-comment corpus
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
+```
+
+`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are
+monotonic in time, so number > 1608 == opened after #1608), keeps the ones whose
+changed files touch audio paths, and pulls each one's reviews, inline comments,
+and issue comments. It is incremental - reruns skip PRs already on disk and only
+fetch new ones (use `--refresh` to re-pull, e.g. to refresh open PRs).
+`build_corpus.py` writes
+`.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md`: one section per
+audio PR (number, title, state, author, link) with every reviewer comment
+verbatim, anchored to `path:line`, plus a recurring-themes tally.
+
+Use the corpus to (a) recognize patterns reviewers repeatedly raise (the lenses
+in section 2 came from exactly this), and (b) check whether the PR in front of
+you repeats a mistake already called out elsewhere. It is read-only context; it
+never auto-posts anything. If the corpus scripts fail or the consolidated file
+is missing, stop the review and fix the failure before writing findings.
+
+---
+
+## 5. GitHub data + scripts reference
+
+`scripts/ensure_repo.sh [CLONE_DIR]` - reuse an existing Curator checkout or
+shallow-clone one; prints `CURATOR_REPO=<path>` on its last line.
+
+`scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` pulls into the
+scratch outdir:
+
+| File | Endpoint |
+|------|----------|
+| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
+| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
+| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
+| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
+| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
+| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
+| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
+
+The REST inline-comments endpoint does not expose resolve/outdate state; the
+GraphQL thread payload does. The builder joins them by comment `databaseId`
+(== REST `id`), falling back to a `(path, body-prefix)` match when an older
+thread dump lacks `databaseId`.
+
+`scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]`
+renders the working digest + the prior-open-threads context file (you add your
+own findings).
+
+`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]`
+(incremental - skips PRs already on disk; `--refresh` re-pulls) and
+`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` **must** run before
+every review to build the post-#1608 corpus (section 4; SKILL.md step 3).
+
+Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete).
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md
new file mode 100644
index 0000000000..dd40c15ef0
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md
@@ -0,0 +1,78 @@
+# Scripts - review-curator-audio-pr
+
+Helpers for a **reviewer** reviewing someone else's audio Curator PR. They pull
+the PR (and prior audio review history) and summarise it as context; you write
+the actual review findings. All require the GitHub CLI (`gh`) authenticated
+against github.com, and write to a scratch directory (default
+`.curator-pr-review/`) that is safe to delete; do not commit its contents.
+
+## ensure_repo.sh
+
+```bash
+eval "$(.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh | tail -1)"  # CURATOR_REPO=<path>
+```
+
+Avoids cloning the whole repo unnecessarily: reuses an existing checkout if you
+are already inside one or there is one under the current directory tree
+(bounded by `MAXDEPTH`, default 4); only then `git clone --depth 1` (no full
+history) from <https://github.com/NVIDIA-NeMo/Curator>. Prints
+`CURATOR_REPO=<absolute path>` on the last line.
+
+## pr_review_pull.sh
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
+```
+
+Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) plus the GraphQL review threads into
+`pr<N>_*_latest.json` (and timestamped snapshots for delta analysis).
+**Audio-only:** it pulls metadata + files first and aborts (exit 3) if the PR
+touches no audio path, so the skill never reviews a non-audio PR.
+
+## build_digest.py
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
+```
+
+Joins the `pr<N>_*_latest.json` files and writes two context files:
+`curator_pr<N>_fresh_review_<date>.md` (working digest: PR state, diff, and
+existing reviews/comments with OPEN/OUTDATED/RESOLVED status, plus a placeholder
+for your findings) and `curator_pr<N>_github_comment_queue_<date>.md` (the
+still-open threads other reviewers left, so you don't duplicate them).
+
+- `--prev-head` sets the SHA in the "commits since prior reviewed head" header.
+- `--baseline-ts <TS>` (a prior pull's timestamp suffix) marks
+  comments/reviews/commits NEW vs already-seen.
+
+The thread join uses comment `databaseId` when present, else a
+(path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
+correctly. It also re-checks the audio-only guard and aborts on a non-audio PR.
+
+## pull_audio_pr_corpus.sh + build_corpus.py (required pre-review corpus)
+
+Run these **before every audio PR review** (SKILL.md step 3). Do not skip even
+when `.curator-pr-review/audio-corpus/` already has data - the pull is
+incremental and `build_corpus.py` must render today's consolidated file.
+
+```bash
+.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
+.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD>
+```
+
+`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are
+monotonic in time, so this is "opened after that PR"), keeps the ones touching
+audio paths, and pulls each one's reviews + inline + issue comments into
+`.curator-pr-review/audio-corpus/`. It is **incremental**: PRs already pulled are
+skipped (and non-audio candidates are remembered in `_non_audio_prs.txt`), so a
+rerun only fetches PRs new since last time. Pass `--refresh` to re-pull
+everything (e.g. to pick up new comments on still-open PRs).
+`build_corpus.py` consolidates them into
+`audio_pr_corpus_<date>.md`: one section per audio PR with every reviewer comment
+verbatim (anchored to path:line) plus a recurring-themes tally. Read-only
+context; it posts nothing. See `../knowledge-sources.md` section 4.
+
+The single-PR scripts are **audio-only**: both `pr_review_pull.sh` and
+`build_digest.py` abort on a PR that touches no audio path. The audio review
+lenses live in `../knowledge-sources.md`.
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
new file mode 100755
index 0000000000..2df94e3e78
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py
@@ -0,0 +1,159 @@
+#!/usr/bin/env python3
+"""Consolidate the audio PR corpus pulled by pull_audio_pr_corpus.sh.
+
+Reads the per-PR JSON in the corpus dir and writes one markdown file with every
+reviewer comment (verbatim, anchored to path:line) grouped by PR, plus a
+recurring-themes keyword tally. Read-only context for pre-review; posts nothing.
+
+Usage: build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]
+"""
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import json
+import re
+from pathlib import Path
+
+BOT_LOGINS = {"greptile-apps[bot]", "copy-pr-bot[bot]", "github-actions[bot]"}
+
+# (label, regex) recurring-theme buckets, mirroring knowledge-sources.md lenses.
+THEMES = [
+    ("setup/teardown lifecycle", r"setup_on_node|\bsetup\(|teardown|_setup_done"),
+    ("optional/lazy imports", r"top[- ]level import|lazy import|import .* fails|optional (dep|extra)"),
+    ("dependency declaration/pins", r"pyproject|optional[- ]?group|==|version pin|requirement"),
+    ("stage contract inputs/outputs", r"inputs\(\)|outputs\(\)|validate_input|NotImplementedError"),
+    ("batch_size / process_batch", r"batch_size|process_batch"),
+    ("memory / serialization", r"ndarray|json\.dumps|serializ|waveform|tensor|OOM|memory"),
+    ("fsspec / cloud I/O", r"fsspec|url_to_fs|s3|gcs|http"),
+    ("secrets / logging", r"token|secret|credential|password|redact"),
+    ("tests / coverage", r"\btest|coverage|pytest|fixture"),
+    ("copyright / lint", r"copyright|header|ruff|lint"),
+    ("naming / convention", r"naming|rename|convention|AudioTask|AudioBatch"),
+    ("trust_remote_code", r"trust_remote_code"),
+]
+
+
+def load(p: Path):
+    return json.loads(p.read_text()) if p.exists() else []
+
+
+def shorten(s: str, n: int = 1200) -> str:
+    s = (s or "").strip()
+    return s if len(s) <= n else s[:n] + " […]"
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--outdir", default=".curator-pr-review/audio-corpus")
+    ap.add_argument("--today", default=dt.date.today().isoformat())
+    args = ap.parse_args()
+
+    outdir = Path(args.outdir)
+    nums_file = outdir / "_audio_pr_numbers.txt"
+    if not nums_file.exists():
+        raise SystemExit(f"no {nums_file}; run pull_audio_pr_corpus.sh first")
+    numbers = [int(x) for x in nums_file.read_text().split() if x.strip()]
+    numbers.sort(reverse=True)
+
+    date_us = args.today.replace("-", "_")
+    theme_counts = {label: 0 for label, _ in THEMES}
+    theme_rx = [(label, re.compile(rx, re.I)) for label, rx in THEMES]
+
+    out: list[str] = []
+    out.append(f"# Audio PR review corpus (post-#1608) - {args.today}\n")
+    out.append(
+        "Consolidated reviewer feedback on audio PRs opened after #1608 "
+        "(open + closed/merged). Read-only pre-review context: recognise the "
+        "patterns reviewers repeatedly raise, and check the PR in front of you "
+        "against them. Bot reviewers are marked `[bot]`.\n"
+    )
+    out.append(f"Audio PRs in corpus: **{len(numbers)}** "
+               f"({', '.join('#' + str(n) for n in numbers)})\n")
+
+    per_pr_sections: list[str] = []
+    total_comments = 0
+    for n in numbers:
+        gh = load(outdir / f"pr{n}_gh.json")
+        if isinstance(gh, list):
+            gh = gh[0] if gh else {}
+        reviews = load(outdir / f"pr{n}_reviews.json")
+        rcomments = load(outdir / f"pr{n}_review_comments.json")
+        icomments = load(outdir / f"pr{n}_issue_comments.json")
+
+        author = (gh.get("author") or {}).get("login", "?")
+        state = gh.get("state", "?")
+        title = gh.get("title", "")
+        url = gh.get("url", f"https://github.com/NVIDIA-NeMo/Curator/pull/{n}")
+
+        sec: list[str] = []
+        sec.append(f"## PR #{n} - {title}\n")
+        sec.append(f"- state: **{state}**  author: @{author}  "
+                   f"created: {gh.get('createdAt','?')}  link: {url}\n")
+
+        # Review summaries (non-empty bodies only).
+        rev_bodies = [r for r in reviews if (r.get("body") or "").strip()]
+        if rev_bodies:
+            sec.append("### Review summaries\n")
+            for r in rev_bodies:
+                login = (r.get("user") or {}).get("login", "?")
+                bot = " `[bot]`" if login in BOT_LOGINS else ""
+                sec.append(f"- **@{login}{bot}** [{r.get('state','')}] "
+                           f"{r.get('submitted_at','')}:\n")
+                sec.append(f"  > {shorten(r.get('body'))}\n")
+                for label, rx in theme_rx:
+                    if rx.search(r.get("body") or ""):
+                        theme_counts[label] += 1
+
+        # Inline comments grouped by file.
+        by_file: dict[str, list] = {}
+        for c in rcomments:
+            by_file.setdefault(c.get("path", "?"), []).append(c)
+        if by_file:
+            sec.append("### Inline review comments\n")
+            for path in sorted(by_file):
+                sec.append(f"#### `{path}`\n")
+                for c in sorted(by_file[path],
+                                key=lambda x: (x.get("line") or x.get("original_line") or 0)):
+                    login = (c.get("user") or {}).get("login", "?")
+                    bot = " `[bot]`" if login in BOT_LOGINS else ""
+                    line = c.get("line") or c.get("original_line") or "?"
+                    body = c.get("body") or ""
+                    total_comments += 1
+                    sec.append(f"- **@{login}{bot}** line {line} "
+                               f"([link]({c.get('html_url','')})):\n")
+                    sec.append(f"  > {shorten(body)}\n")
+                    for label, rx in theme_rx:
+                        if rx.search(body):
+                            theme_counts[label] += 1
+
+        # Top-level issue comments (skip pure CI noise).
+        human_ic = [c for c in icomments
+                    if (c.get("user") or {}).get("login") not in BOT_LOGINS
+                    and (c.get("body") or "").strip()]
+        if human_ic:
+            sec.append("### Discussion (top-level)\n")
+            for c in human_ic:
+                login = (c.get("user") or {}).get("login", "?")
+                sec.append(f"- **@{login}** {c.get('created_at','')}: "
+                           f"{shorten(c.get('body'), 600)}\n")
+
+        per_pr_sections.append("\n".join(sec))
+
+    # Theme tally up top.
+    out.append("## Recurring themes (comment hits across the corpus)\n")
+    out.append("| Theme | Comments mentioning it |")
+    out.append("|-------|------------------------|")
+    for label, _ in THEMES:
+        out.append(f"| {label} | {theme_counts[label]} |")
+    out.append(f"\nTotal inline review comments scanned: **{total_comments}**\n")
+    out.append("---\n")
+    out.extend(per_pr_sections)
+
+    outpath = outdir / f"audio_pr_corpus_{date_us}.md"
+    outpath.write_text("\n".join(out) + "\n", encoding="utf-8")
+    print(f"wrote {outpath}  ({outpath.stat().st_size} bytes; {len(numbers)} PRs)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
new file mode 100755
index 0000000000..7b920acd18
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py
@@ -0,0 +1,353 @@
+#!/usr/bin/env python3
+"""Build a Curator PR fresh-review digest + open-comment queue.
+
+Usage: build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD]
+                        [--prev-head SHA] [--baseline-ts TS]
+
+Reads the pr<N>_*_latest.json files written by pr_review_pull.sh from --outdir
+(default .curator-pr-review) and writes into the same directory:
+  curator_pr<N>_fresh_review_<date>.md          full digest
+  curator_pr<N>_github_comment_queue_<date>.md  open-thread paste queue
+
+The GraphQL thread payload is joined to REST inline comments by `databaseId`
+when present, otherwise by a (path, body-prefix) match, so both thread-dump
+shapes classify correctly. Pass --baseline-ts <TS> (a prior pull's timestamp
+suffix) to mark comments / reviews / commits NEW vs already-seen.
+"""
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import json
+import re
+from pathlib import Path
+
+BODY_KEY_LEN = 120
+
+# This skill is audio-only; a PR must touch at least one of these paths.
+AUDIO_RE = re.compile(
+    r"^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|"
+    r"tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|"
+    r"benchmarking/.*([Aa]udio|ALM|alm))"
+)
+
+
+def load(path: Path) -> object:
+    return json.loads(path.read_text())
+
+
+def load_baseline_ids(baseline, key):
+    if baseline is None or not baseline.exists():
+        return set()
+    return {entry[key] for entry in load(baseline) if key in entry}
+
+
+def build_thread_index(threads_payload):
+    """Return {by_dbid, by_pathbody, nthreads}, robust to both GraphQL shapes."""
+    idx = {"by_dbid": {}, "by_pathbody": {}, "nthreads": 0}
+    try:
+        nodes = threads_payload["data"]["repository"]["pullRequest"]["reviewThreads"]["nodes"]
+    except (KeyError, TypeError):
+        return idx
+    idx["nthreads"] = len(nodes)
+    for thread in nodes:
+        cnodes = thread["comments"]["nodes"]
+        path = thread.get("path")
+        for pos, c in enumerate(cnodes):
+            meta = {
+                "thread_id": thread.get("id"),
+                "is_resolved": thread.get("isResolved", False),
+                "is_outdated": thread.get("isOutdated", False),
+                "thread_comment_count": len(cnodes),
+                "is_first_in_thread": pos == 0,
+            }
+            db = c.get("databaseId")
+            if db is not None:
+                idx["by_dbid"][int(db)] = meta
+            body = (c.get("body") or "").strip()
+            if body:
+                idx["by_pathbody"].setdefault((path, body[:BODY_KEY_LEN]), meta)
+    return idx
+
+
+def thread_meta(idx, c):
+    m = idx["by_dbid"].get(c["id"])
+    if m is not None:
+        return m
+    body = (c.get("body") or "").strip()
+    return idx["by_pathbody"].get((c.get("path"), body[:BODY_KEY_LEN]))
+
+
+def shorten(body, n=600):
+    body = (body or "").strip()
+    if len(body) <= n:
+        return body
+    return body[:n].rstrip() + "\n[...truncated...]"
+
+
+def status_of(meta):
+    if meta is None:
+        return "ORPHAN"
+    if meta["is_resolved"]:
+        return "RESOLVED"
+    if meta["is_outdated"]:
+        return "OUTDATED"
+    return "OPEN"
+
+
+def p(parts, *lines):
+    parts.extend(lines)
+
+
+def area_of(path):
+    """Group a changed file into a coarse audio area for the overview table."""
+    parts = path.split("/")
+    if path.startswith("nemo_curator/stages/audio/") and len(parts) >= 4:
+        return f"stages/audio/{parts[3]}"
+    if path.startswith("nemo_curator/tasks/"):
+        return "tasks"
+    if path.startswith("tutorials/audio/") and len(parts) >= 3:
+        return f"tutorials/audio/{parts[2]}"
+    if path.startswith("tests/"):
+        return "tests"
+    if path.startswith("benchmarking/"):
+        return "benchmarking"
+    if path.startswith("fern/"):
+        return "docs (fern)"
+    return parts[0] if parts else path
+
+
+def build(pr, outdir, today, baseline_ts, prev_head):
+    def latest(kind):
+        return outdir / f"pr{pr}_{kind}_latest.json"
+
+    gh = load(latest("gh"))
+    reviews = load(latest("reviews"))
+    review_comments = load(latest("review_comments"))
+    issue_comments = load(latest("issue_comments"))
+    files = load(latest("files"))
+    commits = load(latest("commits"))
+
+    if files and not any(AUDIO_RE.search(f.get("filename", "")) for f in files):
+        raise SystemExit(
+            f"PR {pr} touches no audio path; review-curator-audio-pr is audio-only. "
+            "Aborting (audio paths: nemo_curator/stages/audio/, "
+            "nemo_curator/tasks/audio_task.py, tutorials/audio/, "
+            "tests/stages/audio/, audio benchmarks)."
+        )
+    threads_path = latest("review_threads")
+    idx = build_thread_index(load(threads_path)) if threads_path.exists() else build_thread_index({})
+
+    def base(kind):
+        return outdir / f"pr{pr}_{kind}_{baseline_ts}.json" if baseline_ts else None
+
+    base_comment_ids = load_baseline_ids(base("review_comments"), "id")
+    base_review_ids = load_baseline_ids(base("reviews"), "id")
+    base_issue_ids = load_baseline_ids(base("issue_comments"), "id")
+    base_commit_shas = set()
+    bc = base("commits")
+    if bc and bc.exists():
+        base_commit_shas = {c["sha"] for c in load(bc) if "sha" in c}
+
+    head_oid = gh.get("headRefOid", "")
+    base_oid = gh.get("baseRefOid", "")
+    prev8 = (prev_head or "")[:8] or "(none)"
+    date_us = today.replace("-", "_")
+
+    # ----- DIGEST -----
+    d = []
+    p(d, f"# Curator PR {pr} Fresh Review - {today}", "")
+    p(d, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "")
+    p(d, f"Current PR head reviewed: `{head_oid}`", "")
+    p(d, f"Base recorded by GitHub metadata: `{base_oid}`", "")
+    if prev_head:
+        p(d, f"Previous reviewed head: `{prev_head}`", "")
+    p(d, "")
+
+    # ----- PR OVERVIEW (what the PR is about; present BEFORE any review) -----
+    author = (gh.get("author") or {}).get("login", "?")
+    p(d, "## What this PR does (overview)", "")
+    p(d, f"**{gh.get('title','')}** - by @{author}  "
+         f"({gh.get('state')}, +{gh.get('additions')}/-{gh.get('deletions')} "
+         f"across {gh.get('changedFiles')} files, {len(commits)} commits)", "")
+
+    p(d, "### Author's description", "")
+    pr_body = (gh.get("body") or "").strip()
+    p(d, pr_body if pr_body else "_(no PR description provided)_", "")
+
+    p(d, "### Areas touched", "")
+    areas = {}
+    for f in files:
+        e = areas.setdefault(area_of(f["filename"]), {"n": 0, "add": 0, "dele": 0})
+        e["n"] += 1
+        e["add"] += f.get("additions", 0)
+        e["dele"] += f.get("deletions", 0)
+    p(d, "| Area | files | +/- |", "|---|---|---|")
+    for a in sorted(areas):
+        e = areas[a]
+        p(d, f"| {a} | {e['n']} | +{e['add']} -{e['dele']} |")
+    p(d, "")
+
+    p(d, "### Plain-language summary (reviewer writes this BEFORE any findings)", "")
+    p(d, "_After reading the knowledge sources (knowledge-sources.md), explain in "
+         "detail, in your own words, what this PR changes and why: the "
+         "problem it solves, the main audio stages/files it adds or modifies, key "
+         "design decisions, new dependencies/config/APIs, and the blast radius "
+         "(what could regress). Present this overview first; only then move to the "
+         "findings below._", "")
+
+    p(d, "## PR state at review time", "")
+    p(d, "| Field | Value |", "|---|---|")
+    p(d, f"| state | {gh.get('state')} |")
+    p(d, f"| isDraft | {gh.get('isDraft')} |")
+    p(d, f"| reviewDecision | {gh.get('reviewDecision') or '(none)'} |")
+    p(d, f"| mergeable / mergeStateStatus | {gh.get('mergeable')} / {gh.get('mergeStateStatus')} |")
+    p(d, f"| changedFiles | {gh.get('changedFiles')} |")
+    p(d, f"| additions / deletions | +{gh.get('additions')} / -{gh.get('deletions')} |")
+    p(d, f"| commits | {len(commits)} |")
+    p(d, f"| inline review comments total | {len(review_comments)} |")
+    p(d, f"| reviews submitted | {len(reviews)} |")
+    p(d, f"| top-level issue comments | {len(issue_comments)} |")
+    p(d, f"| updatedAt | {gh.get('updatedAt')} |", "")
+
+    new_commits = [c for c in commits if c["sha"] not in base_commit_shas] if base_commit_shas else commits
+    p(d, f"## Commits since prior reviewed head `{prev8}` ({len(new_commits)} new)", "")
+    p(d, "```")
+    for c in new_commits:
+        sha = c["sha"][:8]
+        msg = (c["commit"]["message"] or "").splitlines()[0]
+        cdate = (c["commit"]["author"] or {}).get("date", "")
+        p(d, f"{sha}  {msg}  ({cdate[:10]})")
+    p(d, "```", "")
+
+    p(d, "Files changed in current PR head (`git diff <base>..<head>`):", "")
+    p(d, "| File | +/- | status |", "|---|---|---|")
+    for f in sorted(files, key=lambda x: x["filename"]):
+        p(d, f"| `{f['filename']}` | +{f['additions']} -{f['deletions']} | {f['status']} |")
+    p(d, "")
+
+    p(d, "## Existing reviews (by other reviewers)", "")
+    review_by_id = {r["id"]: r for r in reviews}
+    for r in sorted(reviews, key=lambda x: x.get("submitted_at") or ""):
+        marker = " (NEW)" if base_review_ids and r["id"] not in base_review_ids else ""
+        commit = (r.get("commit_id") or "")[:8] or "n/a"
+        p(d, f"### #{r['id']} by @{r['user']['login']}  state={r['state']}  "
+             f"commit={commit}  submitted={r.get('submitted_at')}{marker}")
+        body = (r.get("body") or "").strip()
+        if body:
+            p(d, "", shorten(body, 1500))
+        p(d, "")
+
+    p(d, "## Existing inline comments (by other reviewers)", "")
+    p(d, f"Total: {len(review_comments)} comments across {idx['nthreads'] or '?'} threads.", "")
+    by_status = {"OPEN": [], "OUTDATED": [], "RESOLVED": [], "ORPHAN": []}
+    for c in review_comments:
+        by_status[status_of(thread_meta(idx, c))].append(c)
+    for s in ("OPEN", "OUTDATED", "RESOLVED", "ORPHAN"):
+        p(d, f"- {s}: {len(by_status[s])}")
+    p(d, "")
+
+    by_path = {}
+    for c in review_comments:
+        by_path.setdefault(c["path"], []).append(c)
+    for path in sorted(by_path):
+        p(d, f"### `{path}`", "")
+        for c in sorted(by_path[path], key=lambda x: (x.get("line") or x.get("original_line") or 0, x["created_at"])):
+            meta = thread_meta(idx, c)
+            status = status_of(meta)
+            new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else ""
+            line = c.get("line") or c.get("original_line")
+            review = review_by_id.get(c.get("pull_request_review_id"), {})
+            commit = (c.get("commit_id") or "")[:8]
+            in_reply = c.get("in_reply_to_id")
+            reply_marker = f"  reply->{in_reply}" if in_reply else ""
+            tcount = meta["thread_comment_count"] if meta else 1
+            tpos = " (root)" if meta and meta["is_first_in_thread"] else (" (reply)" if meta else "")
+            p(d, f"- **#{c['id']}** @{c['user']['login']} {c['created_at']}  "
+                 f"line={line} commit={commit} status=**{status}**{new}  "
+                 f"thread_comments={tcount}{tpos} review_state={review.get('state', '?')}{reply_marker}")
+            p(d, f"  url: {c['html_url']}")
+            body = shorten(c.get("body") or "", 700).replace("\n", "\n  > ")
+            p(d, f"  > {body}", "")
+    p(d, "")
+
+    p(d, "## Existing issue comments (by other reviewers)", "")
+    for c in sorted(issue_comments, key=lambda x: x["created_at"]):
+        new = " NEW" if base_issue_ids and c["id"] not in base_issue_ids else ""
+        p(d, f"### #{c['id']} by @{c['user']['login']}  {c['created_at']}{new}", "")
+        p(d, shorten(c.get("body") or "", 1500), "")
+
+    p(d, "## My findings (your review)", "")
+    p(d, "_Add your findings here as you review. Classify each P0-P3, cite "
+         "path:line on the current head, and propose a concrete fix. See "
+         "templates.md section C._", "")
+    p(d, "## Verdict", "")
+    p(d, "_APPROVE / COMMENT / REQUEST CHANGES + blockers._", "")
+
+    digest_path = outdir / f"curator_pr{pr}_fresh_review_{date_us}.md"
+    digest_path.write_text("\n".join(d) + "\n", encoding="utf-8")
+    print(f"wrote {digest_path}  ({digest_path.stat().st_size} bytes)")
+
+    # ----- COMMENT QUEUE (open root threads only) -----
+    q = []
+    p(q, f"# Curator PR {pr} Open Review Threads - {today}", "")
+    p(q, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "")
+    p(q, f"Current PR head: `{head_oid}`", "")
+    p(q, "Threads other reviewers left that are still unresolved on the current "
+         "head. Scan these before adding your own comments so you do not duplicate, "
+         "and check whether the author addressed them. Stale (outdated/resolved) "
+         "threads are listed at the bottom for context only.", "")
+
+    def is_root(c):
+        return (thread_meta(idx, c) or {}).get("is_first_in_thread", True)
+
+    open_root = [c for c in review_comments if status_of(thread_meta(idx, c)) == "OPEN" and is_root(c)]
+    open_root.sort(key=lambda c: (c["path"], c.get("line") or c.get("original_line") or 0, c["created_at"]))
+    for i, c in enumerate(open_root, 1):
+        meta = thread_meta(idx, c) or {}
+        line = c.get("line") or c.get("original_line")
+        new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else ""
+        p(q, f"## Thread {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "")
+        p(q, f"Thread: {c['html_url']}", "")
+        if meta.get("thread_comment_count", 1) > 1:
+            p(q, f"Thread has {meta['thread_comment_count']} comments - see the digest for the full thread.", "")
+        p(q, f"File: `{c['path']}`", "")
+        p(q, f"Line: `{line}`", "")
+        p(q, "Reviewer text:", "")
+        p(q, "```text", (c.get("body") or "").strip(), "```", "")
+
+    p(q, "## Stale (outdated/resolved) comments", "")
+    p(q, "These threads pre-date the current head and are outdated/resolved. Listed "
+         "for traceability only - no action needed unless a reviewer reopens.", "")
+    stale = [c for c in review_comments
+             if status_of(thread_meta(idx, c)) in ("OUTDATED", "RESOLVED") and is_root(c)]
+    stale.sort(key=lambda c: (status_of(thread_meta(idx, c)), c["path"], c.get("original_line") or 0))
+    for c in stale:
+        line = c.get("line") or c.get("original_line")
+        body = shorten((c.get("body") or "").strip(), 200).replace("\n", " ")
+        p(q, f"- [{status_of(thread_meta(idx, c))}] `{c['path']}:{line}` "
+             f"@{c['user']['login']} ({c['html_url']}): {body}")
+    p(q, "")
+
+    queue_path = outdir / f"curator_pr{pr}_github_comment_queue_{date_us}.md"
+    queue_path.write_text("\n".join(q) + "\n", encoding="utf-8")
+    print(f"wrote {queue_path}  ({queue_path.stat().st_size} bytes)")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("pr", type=int, help="PR number")
+    ap.add_argument("--outdir", default=".curator-pr-review",
+                    help="directory holding pr<N>_*_latest.json (default .curator-pr-review)")
+    ap.add_argument("--today", default=None, help="ISO date for filenames; default today UTC")
+    ap.add_argument("--baseline-ts", default=None,
+                    help="timestamp suffix of a prior pull to mark NEW vs already-seen")
+    ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta")
+    args = ap.parse_args()
+    today = args.today or dt.datetime.utcnow().date().isoformat()
+    build(args.pr, Path(args.outdir), today, args.baseline_ts, args.prev_head)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh
new file mode 100755
index 0000000000..4b9c914c39
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env bash
+# Make a local NVIDIA-NeMo/Curator checkout available WITHOUT cloning the whole
+# repo unnecessarily.
+#
+# Strategy (cheapest first):
+#   1. If we are already inside a Curator checkout, use it.
+#   2. Else search the current directory tree (bounded depth) for an existing
+#      Curator checkout and reuse it.
+#   3. Else shallow-clone (--depth 1, no full history) into ./Curator.
+#
+# Prints diagnostics to stderr and, on the LAST stdout line:
+#   CURATOR_REPO=<absolute path to the checkout>
+#
+# Usage: ensure_repo.sh [CLONE_DIR]   (default CLONE_DIR=Curator)
+#   MAXDEPTH=<n>   bound the downward search (default 4)
+#
+# Clone source: https://github.com/NVIDIA-NeMo/Curator
+set -euo pipefail
+
+REPO_URL="https://github.com/NVIDIA-NeMo/Curator.git"
+CLONE_DIR="${1:-Curator}"
+MAXDEPTH="${MAXDEPTH:-4}"
+MARKER="nemo_curator/stages/audio/README.md"
+
+is_curator() { [[ -f "$1/${MARKER}" && -d "$1/.cursor/rules" ]]; }
+abspath()    { (cd "$1" && pwd); }
+
+# 1) Already inside a Curator checkout?
+if top="$(git rev-parse --show-toplevel 2>/dev/null)" && is_curator "${top}"; then
+    echo "ensure_repo: already inside Curator checkout (${top}); not cloning." >&2
+    echo "CURATOR_REPO=${top}"; exit 0
+fi
+if is_curator "."; then
+    echo "ensure_repo: current directory is a Curator checkout; not cloning." >&2
+    echo "CURATOR_REPO=$(abspath .)"; exit 0
+fi
+
+# 2) Existing checkout somewhere under the current tree?
+while IFS= read -r marker; do
+    cand="${marker%/${MARKER}}"
+    if is_curator "${cand}"; then
+        echo "ensure_repo: reusing existing checkout at ${cand}; not cloning." >&2
+        echo "CURATOR_REPO=$(abspath "${cand}")"; exit 0
+    fi
+done < <(find . -maxdepth "${MAXDEPTH}" -path "*/${MARKER}" 2>/dev/null)
+
+# 3) Reuse a prior clone dir if present, else shallow-clone.
+if is_curator "${CLONE_DIR}"; then
+    echo "ensure_repo: reusing ${CLONE_DIR}; not cloning." >&2
+    echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")"; exit 0
+fi
+
+command -v git >/dev/null || { echo "ensure_repo: git not found" >&2; exit 2; }
+echo "ensure_repo: no local checkout found; shallow-cloning ${REPO_URL} -> ${CLONE_DIR}" >&2
+git clone --depth 1 "${REPO_URL}" "${CLONE_DIR}"
+is_curator "${CLONE_DIR}" || { echo "ensure_repo: clone did not produce a valid checkout" >&2; exit 1; }
+echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")"
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
new file mode 100755
index 0000000000..9092688f89
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh
@@ -0,0 +1,109 @@
+#!/usr/bin/env bash
+# Fetch GitHub data for an NVIDIA-NeMo/Curator PR review.
+#
+# Usage: pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
+#
+# Writes pr<N>_*_latest.json (consumed by build_digest.py) plus timestamped
+# snapshots so a prior pull is preserved for delta analysis. Pulls six REST
+# endpoints and the GraphQL review threads (which carry isResolved/isOutdated).
+#
+# Requires the GitHub CLI (`gh`) authenticated against github.com.
+set -euo pipefail
+
+PR=""
+REPO="NVIDIA-NeMo/Curator"
+OUTDIR=".curator-pr-review"
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --outdir) OUTDIR="$2"; shift 2 ;;
+        --repo)   REPO="$2";   shift 2 ;;
+        -h|--help)
+            echo "Usage: pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]"; exit 0 ;;
+        *) PR="$1"; shift ;;
+    esac
+done
+[[ -n "${PR}" ]] || { echo "error: PR number required" >&2; exit 2; }
+
+command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; }
+
+mkdir -p "${OUTDIR}"
+TS="$(date -u +%Y%m%dT%H%M%SZ)"
+LOG="${OUTDIR}/pr${PR}_review_pull_${TS}.log"
+
+{
+    echo "=== Pull start ${TS}  PR=${PR}  repo=${REPO} ==="
+    gh --version | head -1
+} | tee -a "${LOG}"
+
+pull_endpoint() {
+    local label="$1"; shift
+    local outfile="$1"; shift
+    echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}"
+    "$@" > "${outfile}"
+    printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}"
+}
+
+GH_FIELDS="number,title,state,isDraft,mergeable,mergeStateStatus,headRefName,headRefOid,baseRefName,baseRefOid,additions,deletions,changedFiles,commits,reviewDecision,reviewRequests,labels,milestone,createdAt,updatedAt,closedAt,mergedAt,author,body,url,statusCheckRollup"
+
+pull_endpoint "pr ${PR} metadata" "${OUTDIR}/pr${PR}_gh_${TS}.json" \
+    gh pr view "${PR}" --repo "${REPO}" --json "${GH_FIELDS}"
+pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/files"
+
+# This skill is audio-only: abort before pulling anything else if the PR's
+# changed files touch no audio path.
+AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))'
+if ! AUDIO_RE="${AUDIO_RE}" python3 - "${OUTDIR}/pr${PR}_files_${TS}.json" <<'PY'
+import json, os, re, sys
+files = [f.get("filename", "") for f in json.load(open(sys.argv[1]))]
+rx = re.compile(os.environ["AUDIO_RE"])
+sys.exit(0 if any(rx.search(f) for f in files) else 1)
+PY
+then
+    echo "error: PR ${PR} touches no audio path; review-curator-audio-pr is audio-only." >&2
+    echo "       audio paths: nemo_curator/stages/audio/, nemo_curator/tasks/audio_task.py," >&2
+    echo "       tutorials/audio/, tests/stages/audio/, audio benchmarks. Aborting." >&2
+    exit 3
+fi
+
+pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"
+pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/comments"
+pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
+    gh api --paginate "repos/${REPO}/issues/${PR}/comments"
+pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/commits"
+
+OWNER="${REPO%%/*}"
+NAME="${REPO##*/}"
+pull_endpoint "graphql reviewThreads" "${OUTDIR}/pr${PR}_review_threads_${TS}.json" \
+    gh api graphql \
+      -f query='query($owner:String!,$repo:String!,$pr:Int!){ repository(owner:$owner,name:$repo){ pullRequest(number:$pr){ reviewThreads(first:100){ nodes{ id isResolved isOutdated isCollapsed line originalLine path comments(first:50){ nodes{ databaseId } } } } } } }' \
+      -f owner="${OWNER}" -f repo="${NAME}" -F pr="${PR}"
+
+for kind in gh reviews review_comments issue_comments files commits review_threads; do
+    cp -f "${OUTDIR}/pr${PR}_${kind}_${TS}.json" "${OUTDIR}/pr${PR}_${kind}_latest.json"
+done
+
+{
+    echo "--- counts ---"
+    for kind in reviews review_comments issue_comments files commits; do
+        f="${OUTDIR}/pr${PR}_${kind}_${TS}.json"
+        n=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(len(d) if isinstance(d,list) else 1)" "${f}")
+        printf 'pr%s_%s: %s entries\n' "${PR}" "${kind}" "${n}"
+    done
+    echo "--- head SHA / activity ---"
+    python3 - "${OUTDIR}/pr${PR}_gh_latest.json" <<'PY'
+import json, sys
+gh = json.loads(open(sys.argv[1]).read())
+print(f"head_oid:   {gh.get('headRefOid')}")
+print(f"base_oid:   {gh.get('baseRefOid')}")
+print(f"state:      {gh.get('state')}  draft={gh.get('isDraft')}  mergeStatus={gh.get('mergeStateStatus')}  reviewDecision={gh.get('reviewDecision')}")
+print(f"counts:     files={gh.get('changedFiles')} +{gh.get('additions')}/-{gh.get('deletions')}")
+print(f"updated_at: {gh.get('updatedAt')}")
+PY
+} | tee -a "${LOG}"
+
+echo "PR${PR}_REVIEW_PULL_DONE  outdir=${OUTDIR}  log=${LOG}" | tee -a "${LOG}"
diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
new file mode 100755
index 0000000000..a88cdb8659
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh
@@ -0,0 +1,120 @@
+#!/usr/bin/env bash
+# Discover audio-modality PRs opened AFTER a baseline PR (#1608 by default),
+# open or closed/merged, and pull each one's reviews + comments into a corpus
+# directory. Consolidate with build_corpus.py.
+#
+# PR numbers are monotonic in time, so "number > --since" == "opened after that
+# PR". We then keep only PRs whose changed files touch audio paths.
+#
+# Incremental by default: PRs already pulled into OUTDIR are skipped, so reruns
+# only fetch PRs that are new since last time. Pass --refresh to re-pull
+# everything (e.g. to pick up new comments on open PRs).
+#
+# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]
+#
+# Requires the GitHub CLI (`gh`) authenticated against github.com.
+set -euo pipefail
+
+REPO="NVIDIA-NeMo/Curator"
+SINCE=1608
+OUTDIR=".curator-pr-review/audio-corpus"
+LIMIT=600
+REFRESH=0
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --since)   SINCE="$2";  shift 2 ;;
+        --outdir)  OUTDIR="$2"; shift 2 ;;
+        --repo)    REPO="$2";   shift 2 ;;
+        --limit)   LIMIT="$2";  shift 2 ;;
+        --refresh) REFRESH=1;   shift ;;
+        -h|--help)
+            echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]"; exit 0 ;;
+        *) echo "unknown arg: $1" >&2; exit 2 ;;
+    esac
+done
+
+command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; }
+mkdir -p "${OUTDIR}"
+
+# Cache of candidate PRs already classified as non-audio, so reruns don't
+# re-fetch their file lists. Audio PRs are remembered by their pr<N>_gh.json.
+NONAUDIO_CACHE="${OUTDIR}/_non_audio_prs.txt"
+touch "${NONAUDIO_CACHE}"
+declare -A IS_NONAUDIO=()
+while IFS= read -r x; do [[ -n "${x}" ]] && IS_NONAUDIO["${x}"]=1; done < "${NONAUDIO_CACHE}"
+
+# Audio-path test (extended regex over a PR's changed-file list).
+AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))'
+
+echo "=== corpus discovery: ${REPO} PRs > #${SINCE} (state=all, limit=${LIMIT}) ===" >&2
+gh pr list --repo "${REPO}" --state all --limit "${LIMIT}" \
+    --json number,title,state,author,createdAt,updatedAt,url \
+    > "${OUTDIR}/_all_prs.json"
+
+# Candidate PR numbers (> SINCE), newest first.
+mapfile -t CANDIDATES < <(python3 - "${OUTDIR}/_all_prs.json" "${SINCE}" <<'PY'
+import json, sys
+prs = json.load(open(sys.argv[1])); since = int(sys.argv[2])
+nums = sorted((p["number"] for p in prs if p["number"] > since), reverse=True)
+print("\n".join(str(n) for n in nums))
+PY
+)
+echo "candidates after #${SINCE}: ${#CANDIDATES[@]}" >&2
+
+AUDIO_NUMS=()
+for n in "${CANDIDATES[@]}"; do
+    # Incremental: reuse prior classification unless --refresh.
+    if [[ ${REFRESH} -eq 0 ]]; then
+        if [[ -f "${OUTDIR}/pr${n}_gh.json" ]]; then
+            AUDIO_NUMS+=("${n}"); echo "  pr${n}: AUDIO (on disk, skip)" >&2; continue
+        fi
+        if [[ -n "${IS_NONAUDIO[$n]:-}" ]]; then
+            echo "  pr${n}: non-audio (cached, skip)" >&2; continue
+        fi
+    fi
+    files_json="${OUTDIR}/pr${n}_files.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \
+        > "${files_json}" 2>/dev/null || { echo "  pr${n}: files fetch failed, skip" >&2; continue; }
+    if AUDIO_RE="${AUDIO_RE}" python3 - "${files_json}" <<'PY'
+import json, os, re, sys
+files = json.load(open(sys.argv[1]))
+rx = re.compile(os.environ["AUDIO_RE"])
+sys.exit(0 if any(rx.search(f) for f in files) else 1)
+PY
+    then
+        AUDIO_NUMS+=("${n}")
+        echo "  pr${n}: AUDIO" >&2
+    else
+        rm -f "${files_json}"
+        if [[ -z "${IS_NONAUDIO[$n]:-}" ]]; then
+            echo "${n}" >> "${NONAUDIO_CACHE}"; IS_NONAUDIO["${n}"]=1
+        fi
+    fi
+done
+
+echo "audio PRs after #${SINCE}: ${#AUDIO_NUMS[@]}" >&2
+printf '%s\n' "${AUDIO_NUMS[@]}" > "${OUTDIR}/_audio_pr_numbers.txt"
+
+pulled=0
+skipped=0
+for n in "${AUDIO_NUMS[@]}"; do
+    # Incremental: skip PRs whose data is already on disk unless --refresh.
+    if [[ ${REFRESH} -eq 0 && -f "${OUTDIR}/pr${n}_gh.json" \
+          && -f "${OUTDIR}/pr${n}_reviews.json" \
+          && -f "${OUTDIR}/pr${n}_review_comments.json" \
+          && -f "${OUTDIR}/pr${n}_issue_comments.json" ]]; then
+        echo "--- pr${n}: already on disk, skip (use --refresh to re-pull) ---" >&2
+        skipped=$((skipped + 1)); continue
+    fi
+    echo "--- pulling pr${n} reviews/comments ---" >&2
+    gh pr view "${n}" --repo "${REPO}" \
+        --json number,title,state,author,createdAt,updatedAt,mergedAt,closedAt,url,body \
+        > "${OUTDIR}/pr${n}_gh.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/reviews"  > "${OUTDIR}/pr${n}_reviews.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/comments" > "${OUTDIR}/pr${n}_review_comments.json"
+    gh api --paginate "repos/${REPO}/issues/${n}/comments" > "${OUTDIR}/pr${n}_issue_comments.json"
+    pulled=$((pulled + 1))
+done
+
+echo "AUDIO_PR_CORPUS_PULL_DONE  outdir=${OUTDIR}  audio_prs=${#AUDIO_NUMS[@]}  pulled=${pulled}  skipped=${skipped}  refresh=${REFRESH}" >&2
diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md
new file mode 100644
index 0000000000..16af214cb0
--- /dev/null
+++ b/.cursor/skills/review-curator-audio-pr/templates.md
@@ -0,0 +1,158 @@
+# Output templates
+
+You are the reviewer. `scripts/build_digest.py` generates files A and B as
+context; you author the findings (section C) and post them as your review.
+The digest **leads with a PR overview** that you must complete and present
+before any review comments. Replace `<...>` placeholders. Keep the structure
+stable across review passes so a re-review diffs cleanly and only surfaces what
+changed.
+
+## A. Working digest (generated, then you append findings)
+
+File: `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md`. It **leads with a PR
+overview** (what the change does and why), then PR state, the diff, and all
+existing review activity. Complete the overview's plain-language summary first
+and present it before any findings; add your **Findings** section at the end.
+
+````markdown
+# Curator PR <N> Fresh Review - <YYYY-MM-DD>
+
+Review target: https://github.com/NVIDIA-NeMo/Curator/pull/<N>
+
+Current PR head reviewed: `<headRefOid>`
+
+Base recorded by GitHub metadata: `<baseRefOid>`
+
+## What this PR does (overview)
+
+**<title>** - by @<author>  (<state>, +<a>/-<d> across <n> files, <k> commits)
+
+### Author's description
+<verbatim PR body>
+
+### Areas touched
+| Area | files | +/- |
+|---|---|---|
+| stages/audio/<area> | <n> | +<a> -<d> |
+
+### Plain-language summary (write this BEFORE any findings)
+<In your own words, after reading the knowledge sources: what the PR changes and
+why, the main stages/files, key design decisions, new deps/config/APIs, and
+blast radius.>
+
+## PR state at review time
+
+| Field | Value |
+|---|---|
+| state | <OPEN/CLOSED/MERGED> |
+| reviewDecision | <APPROVED/CHANGES_REQUESTED/none> |
+| changedFiles | <n> |
+| additions / deletions | +<a> / -<d> |
+| commits | <n> |
+| inline review comments total | <n> |
+| reviews submitted | <n> |
+| updatedAt | <iso> |
+
+## Commits (<k>)
+
+```
+<sha8>  <subject>  (<date>)
+```
+
+Files changed (`git diff <base>..<head>`):
+
+| File | +/- | status |
+|---|---|---|
+| `<path>` | +<a> -<d> | <added/modified> |
+
+## Existing reviews (by other reviewers)
+
+### #<id> by @<login>  state=<STATE>  commit=<sha8>  submitted=<iso>
+<body or empty>
+
+## Existing inline comments (by other reviewers)
+
+Total: <n> comments across <m> threads.
+
+- OPEN: <n>     # still unresolved - do NOT duplicate these
+- OUTDATED: <n>
+- RESOLVED: <n>
+
+### `<path>`
+
+- **#<id>** @<login> <iso>  line=<L> status=**<OPEN/OUTDATED/RESOLVED>**  ...
+  url: <html_url>
+  > <body, truncated>
+
+## My findings (your review)
+
+### P0: <title>
+<evidence: path:line on current head> - <why it's a blocker> - <concrete fix>
+
+### P1 / P2 / P3: <title>
+<...>
+
+## Verdict
+<APPROVE / COMMENT / REQUEST CHANGES, and the blockers if any>
+````
+
+## B. Prior open threads (generated context)
+
+File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. The still-unresolved
+threads other reviewers already opened, so you can scan prior feedback before
+adding your own. This is **context, not your output** - you are not replying to
+these as the author; you read them to avoid duplicating and to check whether the
+author addressed them.
+
+````markdown
+# Curator PR <N> Open Review Threads - <YYYY-MM-DD>
+
+Current PR head: `<headRefOid>`
+
+Threads other reviewers left that are still unresolved on the current head.
+
+## Thread <i> - <path>:<line> by @<login>
+
+Link: <html_url>
+
+```text
+<verbatim reviewer body>
+```
+
+## Stale (outdated/resolved) threads
+- [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body>
+````
+
+## C. Writing your review comments
+
+Turn each finding into a review comment you post on the PR. For each:
+
+- Anchor it to the exact `path:line-range` on the current head.
+- Lead with severity (P0-P3) and a one-line statement of the problem.
+- Explain *why* it matters (cite a `.cursor/rules/*.mdc` contract or
+  `stages/audio/README.md` when the change deviates).
+- Propose a concrete fix, ideally with a `suggestion` block.
+- Before posting, check the "OPEN" list in file A - if another reviewer already
+  raised it, don't repeat it (optionally +1 their thread instead).
+
+Example review comment:
+
+> **P0 (blocker).** `sharded_manifest_writer.py:101` calls `json.dumps(record)`
+> while `record` can still hold the waveform `ndarray` when `keep_waveform=True`,
+> which raises `TypeError: Object of type ndarray is not JSON serializable` and
+> fails the whole shard. Drop the waveform key before serialising and guard
+> `json.dumps` so a stray non-serialisable value names the offending key.
+> Please also add a test under `tests/stages/audio/io/` covering
+> `keep_waveform=True`.
+
+When you have an exact fix, attach a GitHub `suggestion` block so the author can
+apply it in one click (anchor the comment to the line(s) it replaces):
+
+```suggestion
+        record = {k: v for k, v in record.items() if k != "waveform"}
+        line = json.dumps(record)
+```
+
+Then post the set as one PR review: inline comments for line-specific findings,
+a top-level summary carrying the overall verdict (APPROVE / COMMENT / REQUEST
+CHANGES).
diff --git a/.gitignore b/.gitignore
index 1ba2c180fd..f99780f76e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -169,3 +169,6 @@ fern/product-docs/
 
 # PDF parsing data (symlinked to external storage)
 pdf_parsing/
+
+# NeMo Curator PR review skill scratch output
+.curator-pr-review/