diff --git a/.cursor/skills/review-curator-audio-pr/SKILL.md b/.cursor/skills/review-curator-audio-pr/SKILL.md new file mode 100644 index 0000000000..4adb8ff4c1 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/SKILL.md @@ -0,0 +1,257 @@ +--- +name: review-curator-audio-pr +description: >- + Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is + a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo, + pulls the PR, builds the post-#1608 audio review corpus, diffs the changed + audio code, applies Curator's audio stage contracts and contribution standards, + shows what other reviewers have already raised (so you don't duplicate), and + helps you produce a structured set of review findings (P0-P3) to post as + review comments. Use when you are assigned + or pick up an audio Curator PR and need to review it (e.g. "review audio PR + 1967", "do a review of this Curator audio PR", "what should I flag on PR + 1898"). It is NOT for a PR author responding to a review. +--- + +# Review NeMo Curator audio PRs + +**Use this when you are the reviewer and need to review someone else's audio +Curator PR.** Given a PR number, it helps you understand the change, see what +other reviewers already said, and write your own review. It does not help a PR +author reply to reviewers - it helps you *produce* the review. + +Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`, +`nemo_curator/tasks/audio_task.py`, `tutorials/audio/`, audio tests under +`tests/stages/audio/`, and audio benchmarks. This skill is **audio-only**: if a +PR touches no audio path, the pull step refuses to run (step 2) - review it +with a modality-appropriate tool instead. + +Requirements: an environment with `git` and the GitHub CLI (`gh`) **already +authenticated** against `github.com`. Everything the skill does goes through +`gh`/`git`, so a working `gh` is the only credential you need. Verify with `gh +auth status`; if it is not green, give it an authenticated environment one of +these ways before invoking the skill: + +- **Interactive:** run `gh auth login` (choose GitHub.com, HTTPS) once; the token + is cached for future runs. +- **Token (CI / headless / sandbox):** export `GH_TOKEN` (or `GITHUB_TOKEN`) with + a PAT that has `repo` scope - e.g. `export GH_TOKEN=ghp_...` - then `gh auth + status` should pass. This is the way to hand a remote/agent runner a + `gh`-configured environment. +- **Existing config:** if you already have a `~/.config/gh/hosts.yml` (or a host + with `gh` set up), run the skill from that shell/user so it inherits the auth. + +You do **not** need to pre-clone the repo - step 0 reuses any checkout you +already have and only shallow-clones +[NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found. + +## The reviewer prompt + +A human reviewer invokes this skill with a prompt like: + +> **"Review audio Curator PR with the review-curator-audio-pr skill. Find or +> shallow-clone the repo, pull the PR and prior review activity, build the +> post-#1608 audio review corpus, read the knowledge sources, then give me a +> detailed overview of what the PR does +> followed by my own findings (P0-P3) with exact `path:line` evidence and +> concrete fixes, ready to post as PR comments. Skip anything other reviewers +> already raised."** + +Variants: *"re-review PR and show only what changed since the last review"*, +*"what should I flag on the diarization PR ?"*, *"triage open audio PRs and +review the smallest one first"*. Every review run must build (or refresh) the +post-#1608 audio review corpus before findings (see +[knowledge-sources.md](knowledge-sources.md) section 4). + +Run the prompt in an environment where `gh auth status` passes. For a +remote/headless agent, say so explicitly, e.g. *"`gh` is authenticated via +`GH_TOKEN` in this environment - review audio Curator PR ..."* (see +Requirements above for how to provide a `gh`-configured environment). + +## What you produce + +You produce two things, **in this order**: (1) a **detailed overview of what the +PR does and why** - written only after you have read the knowledge sources (the +canonical docs and review lenses), so the reader understands the change before +any critique - then (2) your **findings** (P0-P3), each tied to a `path:line` on +the PR's current head, that you post as PR review comments. Never lead with +review comments; ground yourself in the knowledge sources, then explain the PR, +then give findings. + +The skill writes two helper files into a scratch directory (default +`.curator-pr-review/`, override with `--outdir`) to support that: + +1. `curator_pr_fresh_review_.md` - **your working digest**: leads + with a **"What this PR does (overview)"** section (the author's description, + an areas-touched table, and a plain-language summary you write), then PR + state, commits, changed-file table, and every existing review/comment grouped + by file with OPEN / OUTDATED / RESOLVED status, then a placeholder for your + findings. Fill in the overview first, then record your findings at the bottom. +2. `curator_pr_github_comment_queue_.md` - **prior open threads**: + a condensed list of the threads other reviewers left that are still + unresolved on the current head. Scan it before writing your own comments. + +See [templates.md](templates.md) for the exact layout and how to phrase your +review comments. The review lenses and every doc/code reference live in +[knowledge-sources.md](knowledge-sources.md). + +## Workflow + +``` +- [ ] 0. Locate or shallow-clone the Curator repo +- [ ] 1. Identify the PR and confirm it touches audio paths +- [ ] 2. Pull fresh GitHub data (gh) +- [ ] 3. Build the post-#1608 audio review corpus (required) +- [ ] 4. Read the diff +- [ ] 5. Read the knowledge sources (canonical docs + review lenses + corpus) to ground yourself +- [ ] 6. Generate the digest + prior-threads file for context +- [ ] 7. Write the detailed PR overview - only after steps 4-5 - present this first +- [ ] 8. Review through the lenses, write up findings (P0-P3), and post them after the overview +``` + +### Step 0 - Locate or shallow-clone the repo + +Don't clone unnecessarily. This helper checks whether you are already inside a +Curator checkout, searches the current directory tree for one, and only then +shallow-clones (`--depth 1`, no full history): + +```bash +eval "$(scripts/ensure_repo.sh | tail -1)" # sets CURATOR_REPO= +cd "$CURATOR_REPO" +``` + +If you are already in the checkout that contains this skill, you can skip this. + +### Step 1 - Identify the PR + +`gh pr view --repo NVIDIA-NeMo/Curator`. This skill is **audio-only**: the +pull step (step 2) aborts if the PR touches no audio path +(`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`, +`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks). If it aborts, the +PR is out of scope for this skill, so stop and do not review it here. + +### Step 2 - Pull fresh GitHub data + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh +``` + +Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue +`comments`, `files`, `commits`) and the GraphQL review threads, which carry the +`isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this +to see prior review activity, not because you own the PR. + +### Step 3 - Build the post-#1608 audio review corpus (required) + +Before you explain or judge anything, build the consolidated reviewer-comment +corpus so you can spot cross-PR repeated mistakes. Do **not** skip this step, +even if a prior corpus file exists on disk - rerun the pull (incremental by +default) and render a fresh consolidated file for today's review: + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today +``` + +Confirm `.curator-pr-review/audio-corpus/audio_pr_corpus_.md` exists and +read it before writing findings. If the pull scripts fail, stop and fix the +failure - do not proceed with a review that skipped the corpus. See +[knowledge-sources.md](knowledge-sources.md) section 4 for details. + +### Step 4 - Read the diff + +```bash +gh pr diff --repo NVIDIA-NeMo/Curator # works on a shallow clone +# optional, to run/inspect locally: gh pr checkout +``` + +`gh pr diff` fetches the patch straight from GitHub, so it works even on the +shallow clone from step 0. Always review the actual diff, never the comment text +alone. Use `main` as the baseline for "how the audio stages already do this". + +### Step 5 - Read the knowledge sources (ground yourself first) + +Before you explain or judge anything, read +[knowledge-sources.md](knowledge-sources.md) so your overview and findings are +grounded in the canonical material, not guesswork: + +- **section 0** - canonical docs: the audio stage developer guide + (`nemo_curator/stages/audio/README.md`), the developer-guide slides, the + published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`. +- **section 1** - the audio code map, so you know where each changed file sits. +- **section 2** - the review lenses (you apply these in step 8). +- **section 4 + corpus file** - the post-#1608 reviewer-comment corpus from step 3. + +Open the specific sources the diff touches (e.g. the fern page and the code for +the stage being changed). Check the corpus for recurring themes and whether this +PR repeats feedback already raised on other audio PRs. Only after this grounding +do you write the overview (step 7). + +### Step 6 - Generate the context files + +```bash +.cursor/skills/review-curator-audio-pr/scripts/build_digest.py --today +``` + +The builder joins the pulled JSON and classifies each existing comment by thread +status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head), +**RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer +already made. + +### Step 7 - Summarize the PR in detail (present this first) + +Now that you have read the diff (step 4) and the knowledge sources (step 5), +write the **"What this PR does (overview)"** section of the digest so a reader +understands the change end to end: the problem it solves; the +main audio stages/files it adds or modifies (use the "Areas touched" table the +builder generates); key design decisions; new dependencies, config, or APIs; and +the blast radius (what could regress). Lead your review with this overview - the +author and other reviewers read it before any findings. + +### Step 8 - Review through the lenses, then write up and post your findings + +Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the +diff - each lens links the audio code, README section, and `.cursor/rules` +contract it governs: stage contracts, setup/teardown lifecycle, audio +optional-dependency hygiene (vLLM, NeMo, model utils), secret-safe logging, +waveform/tensor memory and manifest serialization, tarred/sharded audio I/O, +sample-rate and metadata propagation, streaming/throughput, tutorials/docs, +tests/coverage, and PR reviewability. Cite the linked source by name whenever +the PR deviates. + +Record each issue as a finding, classified by severity, and post them **after +the overview** as a PR review (inline comments for specific lines, a top-level +summary for the overall verdict). Each finding cites the exact `path:line-range` +on the current head and proposes a concrete fix. + +- **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a + manifest writer that crashes on a kept waveform), a secret (HF token) leaked + into logs, or an import that breaks a supported install. +- **P1** - should fix before merge: undeclared audio runtime dependencies, + missing required tests/coverage, environment-specific I/O in a reusable + reader, a PR too large to review safely. +- **P2** - worth fixing: duplicated pipeline construction, inefficient per-task + writer batching, incomplete throughput metrics, documented-but-unwired config. +- **P3** - nice to have: temporary model/compat workarounds needing a test + + comment, debug-only knobs (e.g. per-shard utterance caps) to document. + +See [templates.md](templates.md) section C for how to phrase review comments. + +## What to check the PR against (CONTRIBUTING.md and rules) + +A compliant audio PR must satisfy these; flag any that are missing: + +- **DCO sign-off** on every commit (`git commit -s`). +- **Tests** accompany source changes; GPU-only audio tests use + `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`, + and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`. +- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file. +- **Ruff** lint/format, 119-char lines; **loguru** for logging. +- Supported Python: 3.10-3.12. + +## Knowledge sources + +[knowledge-sources.md](knowledge-sources.md) is the single index: canonical docs +(audio README, developer-guide slides, `.cursor/rules`, tutorials, benchmarks), +the audio code map, the review lenses with per-concept code references, the +post-#1608 PR corpus workflow, and the GitHub/`gh` data reference. diff --git a/.cursor/skills/review-curator-audio-pr/knowledge-sources.md b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md new file mode 100644 index 0000000000..5624b0d189 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/knowledge-sources.md @@ -0,0 +1,343 @@ +# Audio PR review knowledge sources + +Read this before reviewing. It is the single index for an audio Curator review +and has five parts: + +0. **Canonical docs** - the authoritative references to ground every finding. +1. **Audio code map** - where each audio stage lives. +2. **Review lenses** - what to check, each linked to the audio code, README + section, and `.cursor/rules` contract it governs. +3. **Severity mapping** - P0-P3. +4. **Pre-review corpus (required)** - pull every audio PR after #1608 (open + + closed) with reviewer comments and read the consolidated file before you + write findings. +5. **GitHub data + scripts reference** - the `gh` endpoints and helper scripts. + +All paths are repo-relative. The skill is self-contained: this repository plus +the GitHub CLI (`gh`). Scope: the **audio** modality. + +--- + +## 0. Canonical docs (start here) + +| Source | What it gives you | +|--------|-------------------| +| `nemo_curator/stages/audio/README.md` | The audio stage **developer guide** (1000+ lines): CPU vs GPU stages, `process` / `process_batch`, `batch_size`, call chains, per-stage memory characteristics, end-to-end FLEURS & ALM traces, new-stage checklist. Read first. | +| Audio curation developer guide (slides): | Design intent and pipeline overviews behind the audio modality (the deck the README operationalizes). | +| `nemo_curator/tasks/audio_task.py` | `AudioTask(Task[dict])` + `_AttrDict` task model (one manifest entry per task; dict keys exposed as attributes for `validate_input`). | +| `nemo_curator/stages/base.py` | `ProcessingStage` contract: `inputs()`/`outputs()`, `process`/`process_batch`, `setup_on_node`/`setup`/`teardown`, `validate_input`. | +| `CONTRIBUTING.md` | DCO sign-off, PR process, required tests, coverage threshold, copyright header. | +| `.cursor/rules/*.mdc` | Always-on engineering contracts (table below). | +| `tutorials/audio/README.md` + per-pipeline READMEs | Runnable reference pipelines (alm, audio_pretrain, callhome_diar, fleurs, readspeech, single_speaker_filter, tagging). | +| `benchmarking/README.md`, `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` | How audio/ALM benchmarks are wired and the perf expectations to hold a PR to. | +| `tests/stages/audio/` | Test layout that mirrors `nemo_curator/stages/audio/`; the coverage bar. | + +### `.cursor/rules/` contracts + +| Rule | Covers | +|------|--------| +| `.cursor/rules/processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks | +| `.cursor/rules/composite-stage-patterns.mdc` | `CompositeStage` decomposition (e.g. a reader = discovery + shard fan-out) | +| `.cursor/rules/task-patterns.mdc` | `Task` / typed task subclasses, metadata flow | +| `.cursor/rules/executors.mdc` | Xenna / Ray Data backends and when to use each | +| `.cursor/rules/resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation | +| `.cursor/rules/pipeline-structure.mdc` | Pipeline assembly | +| `.cursor/rules/modality-structure.mdc` | Where modality code/tests/tutorials live | +| `.cursor/rules/coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions | + +### Published audio documentation (fern docs site) + +The user-facing audio docs live under `fern/versions//pages/` (latest +versioned snapshot: **v26.04**; `fern/versions/main/` holds in-progress docs). +These are the canonical "what the docs promise" reference - cite them when a PR's +behavior, config, or API drifts from documented behavior. Paths below are +relative to `fern/versions/v26.04/pages/`. + +| Audio concept / stage | Fern doc(s) | Code it documents | +|------------------------|-------------|-------------------| +| Overview / get started | `get-started/audio.mdx`, `curate-audio/index.mdx` | - | +| Concepts (task model, pipelines, metrics) | `about/concepts/audio/{index,audio-task,curation-pipeline,asr-pipeline,alm-pipeline,manifests-ingest,quality-metrics,text-integration}.mdx` | `nemo_curator/tasks/audio_task.py` | +| Load data / manifests | `curate-audio/load-data/{index,fleurs-dataset,custom-manifests,local-files}.mdx` | `nemo_curator/stages/audio/datasets/`, `.../io/` | +| ASR inference | `curate-audio/process-data/asr-inference/{index,nemo-models}.mdx` | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` | +| Audio analysis (duration, format) | `curate-audio/process-data/audio-analysis/{index,duration-calculation,format-validation}.mdx` | `nemo_curator/stages/audio/common.py` | +| Quality filtering (band, VAD, SigMOS, UTMOS, speaker separation, preprocessing, data-filter) | `curate-audio/process-data/quality-filtering/{index,band-filter,vad,sigmos,utmos,speaker-separation,preprocessing,audio-data-filter-stage}.mdx` | `.../filtering/`, `.../segmentation/`, `.../preprocessing/`, `.../advanced_pipelines/` | +| Quality assessment (WER, duration) | `curate-audio/process-data/quality-assessment/{index,wer-filtering,duration-filtering}.mdx` | `nemo_curator/stages/audio/metrics/wer.py`, `.../common.py` | +| ALM (data builder, overlap filtering) | `curate-audio/process-data/alm/{index,data-builder,overlap-filtering}.mdx` | `nemo_curator/stages/audio/alm/` | +| Text integration / tagging | `curate-audio/process-data/text-integration/index.mdx` | `nemo_curator/stages/audio/tagging/` | +| Save / export | `curate-audio/save-export.mdx` | `nemo_curator/stages/audio/io/convert.py` | +| Tutorials | `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx` | `tutorials/audio/` | +| API reference | `api-reference/tasks/audio-task.mdx`, `api-reference/processing-stage.mdx` | `nemo_curator/tasks/audio_task.py`, `nemo_curator/stages/base.py` | +| Infrastructure (memory, per-stage runtime, backends, GPU, resumable, monitoring) | `reference/infrastructure/{memory-management,per-stage-runtime,execution-backends,gpu-processing,resumable-processing,monitoring}.mdx` | `nemo_curator/backends/` | +| Release notes / migration (`AudioBatch -> AudioTask`, #1608) | `about/release-notes/{index,migration-guide,migration-faq}.mdx` | see PR #1608 | + +When a PR changes audio behavior, check whether the matching fern page above +needs updating in the same PR - docs drift is a common review finding. + +--- + +## 1. Audio code map (diff ground truth) + +`nemo_curator/stages/audio/` is organized as: + +| Area | Path(s) | +|------|---------| +| Task model | `nemo_curator/tasks/audio_task.py` | +| Base contract | `nemo_curator/stages/base.py` | +| Common CPU stages (duration, value filter) | `nemo_curator/stages/audio/common.py` | +| I/O (convert to DocumentBatch, segment extraction) | `nemo_curator/stages/audio/io/convert.py`, `nemo_curator/stages/audio/io/extract_segments.py` | +| Preprocessing | `nemo_curator/stages/audio/preprocessing/mono_conversion.py`, `nemo_curator/stages/audio/preprocessing/concatenation.py` | +| Inference - ASR | `nemo_curator/stages/audio/inference/asr/asr_nemo.py` | +| Inference - VAD | `nemo_curator/stages/audio/inference/vad/whisperx_vad.py` | +| Inference - speaker diarization | `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py` | +| Filtering (band, SigMOS, UTMOS) | `nemo_curator/stages/audio/filtering/band.py`, `nemo_curator/stages/audio/filtering/sigmos.py`, `nemo_curator/stages/audio/filtering/utmos.py` | +| Segmentation | `nemo_curator/stages/audio/segmentation/vad_segmentation.py`, `nemo_curator/stages/audio/segmentation/speaker_separation.py` | +| Tagging (inference + text) | `nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py`, `nemo_curator/stages/audio/tagging/merge_alignment_diarization.py`, `nemo_curator/stages/audio/tagging/resample_audio.py`, `nemo_curator/stages/audio/tagging/split.py`, `nemo_curator/stages/audio/tagging/text/itn.py` | +| Postprocessing | `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py` | +| Metrics / throughput | `nemo_curator/stages/audio/metrics/wer.py`, `nemo_curator/stages/audio/metrics/bandwidth.py`, `nemo_curator/stages/audio/metrics/squim.py` | +| Datasets (FLEURS, ReadSpeech) | `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/readspeech/create_initial_manifest.py`, `nemo_curator/stages/audio/datasets/file_utils.py` | +| ALM (audio-language model data) | `nemo_curator/stages/audio/alm/alm_data_builder.py`, `nemo_curator/stages/audio/alm/alm_data_overlap.py`, `nemo_curator/stages/audio/alm/pretrain/pipeline.py`, `nemo_curator/stages/audio/alm/pretrain/io.py` | +| Composite pipelines | `nemo_curator/stages/audio/advanced_pipelines/audio_data_filter/audio_data_filter.py` | + +Backends (how stages run): `nemo_curator/backends/` (`base.py`, `xenna/`, +`ray_data/`, `ray_actor_pool/`); pipeline assembly: +`nemo_curator/pipeline/pipeline.py`; task types: `nemo_curator/tasks/`. + +--- + +## 2. Review lenses (each with code + doc references) + +Apply each lens to the diff and cite exact `path:line` evidence. When a change +violates a contract, link it by name. The **References** under each lens are the +canonical sources to cite. + +### 2.1 Stage contracts and reuse + +- An audio stage subclasses `ProcessingStage[AudioTask, AudioTask]` (or the + appropriate in/out task types) with explicit `inputs()` / `outputs()`, a + meaningful `name`, and appropriate `resources` / `batch_size` (set as class + attributes / dataclass fields, never the read-only `_name` / `_resources` / + `_batch_size` properties). Field order follows the base convention: `name` + first, then params, then `resources`, then `batch_size` last. +- GPU/model work (ASR, VAD, diarization, tagging inference) belongs in + `process_batch()` with a real `batch_size`; `__init__` stays lightweight (it + runs on the driver). Batch-only stages raise `NotImplementedError` from + `process()` (the convention used by dedup stages). +- Reuse existing audio readers, manifest writers, and shared model abstractions + instead of bespoke orchestration. Model a tarred/sharded reader as a + `CompositeStage` = a discovery stage + a shard-reader fan-out. +- If `inputs()` does not declare a required key (e.g. `audio_filepath_key`, + `duration_key`), runtime validation won't check it, and a direct + `task.data[key]` access then raises an opaque `KeyError` instead of a clear + validation error. Declare required keys. +- Fan-out shard readers need deterministic child IDs and must propagate + `_metadata` / `_stage_perf`; declare `ray_stage_spec()` with + `IS_FANOUT_STAGE: True`. First-stage shard discovery from an empty input + should constrain placement (e.g. one worker per node) so discovery isn't + duplicated across workers. +- **References:** `nemo_curator/stages/base.py`; `nemo_curator/tasks/audio_task.py`; `nemo_curator/stages/audio/common.py`; `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py`; rules `processing-stage-patterns.mdc`, `composite-stage-patterns.mdc`, `task-patterns.mdc`; README sections "Writing a CPU stage" / "Writing a GPU stage". + +### 2.2 Setup / teardown lifecycle and audio dependencies + +- `setup_on_node()` = once-per-node prep (model download / shared cache / output + truncation). `setup()` = per-worker init (model load, GPU allocation). + `teardown()` frees GPU memory. Don't conflate them; prefer `setup_on_node` for + anything that must run exactly once (avoids `_setup_done`-style guards). +- Import heavy/optional audio deps (NeMo, vLLM, whisperx, pyannote, soundfile, + model-specific utils) lazily inside `setup()` or behind a guarded import, + never unconditionally at module top level. A top-level `import ` + breaks `import nemo_curator.stages.audio...` on any install without that extra + (including CPU-only / Mac/ARM). +- Declare the full audio runtime dependency set in the appropriate + `pyproject.toml` optional extra (e.g. a CUDA extra); add an import smoke test. + Don't rely on an external Docker image to supply deps a reusable stage needs. +- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries + shared across modalities (`transformers`, `huggingface-hub`, `accelerate`); + prefer compatible ranges. +- Expose model knobs like `cache_dir` and `trust_remote_code` as parameters + rather than hardcoding them. +- **References:** `nemo_curator/stages/audio/inference/asr/asr_nemo.py` (download-on-node / load-on-worker split, `cache_dir`); `nemo_curator/stages/audio/inference/vad/whisperx_vad.py`; `nemo_curator/stages/audio/inference/speaker_diarization/sortformer.py`, `pyannote.py`; `nemo_curator/stages/audio/filtering/{utmos,sigmos,band}.py`; `pyproject.toml`; rules `resources-configuration.mdc`; README "Stage lifecycle". + +### 2.3 Secret-safe logging + +- Never log a full resolved config; audio tutorials commonly accept an HF token + / cloud credentials. Redact secret-like keys before logging (token, password, + secret, credentials, `*_token`/`*_secret`), or log only a non-secret subset, + and add a test asserting the secret value never appears in logged output. +- **References:** `tutorials/audio/*/` (token/credential inputs); any stage that + logs configuration; rules `coding-standards.mdc` (loguru). + +### 2.4 Waveform/tensor memory and manifest serialization + +- Drop waveform arrays / audio tensors from `task.data` once a stage no longer + needs them. A leaked numpy `ndarray` / torch tensor that reaches a JSON + manifest writer raises `TypeError: Object of type ndarray is not JSON + serializable` and can fail an entire shard/batch, or cause OOM at scale. Strip + by waveform key plus a duck-typed guard (`.shape`/`.dtype`), and wrap + `json.dumps` so a stray non-serializable value reports the offending key. +- ALM stages that replace the whole dict must `task.data.clear()` + + `task.data.update(result)` to preserve the `_AttrDict` type. Remember + `_AttrDict.__setattr__` routes writes into the dict, so never stamp private + fields onto a task's data dict. +- Guard audio I/O edge cases: `tarfile.extractfile()` returns `None` for + non-regular members even after `isfile()` - check before `.read()`. Handle + resampling / channel-count / dtype mismatches explicitly. +- **References:** `nemo_curator/stages/audio/io/convert.py`; `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/alm/alm_data_builder.py`; `nemo_curator/stages/audio/preprocessing/{mono_conversion,concatenation}.py`; `nemo_curator/tasks/audio_task.py`; README "Memory characteristics"; fern `reference/infrastructure/{memory-management,per-stage-runtime}.mdx`. + +### 2.5 Tarred / sharded audio I/O + +- Keep infrastructure assumptions (specific object-store URLs, auth env vars, + cluster-local paths) out of the generic reader/writer. Make remote access + pluggable: accept local paths, ordinary cloud URIs, or a user-provided + opener/resolver, and isolate environment-specific support behind a small, + documented, optional component. For piped/streamed downloads use failure-aware + flags + retries so truncated streams fail loudly. +- Cache the fsspec filesystem object in `setup_on_node` rather than re-creating + it per `process()` call (a new HTTP connection per manifest entry is costly on + S3/GCS at scale). +- Manifest/shard writers that loop per task inside `process_batch()` reopen the + output repeatedly; group by shard and open each output once per batch. Preserve + `.done`/checkpoint markers so runs are resumable. +- **References:** `nemo_curator/stages/audio/io/extract_segments.py`; `nemo_curator/stages/audio/datasets/{fleurs,readspeech}/create_initial_manifest.py`; `nemo_curator/stages/audio/datasets/file_utils.py`; `nemo_curator/stages/audio/alm/pretrain/io.py`; rules `composite-stage-patterns.mdc`. + +### 2.6 Sample-rate and metadata propagation + +- Resampling/mono-conversion/segmentation stages must update the sample-rate and + duration keys they change and propagate `_metadata` / `_stage_perf`; downstream + stages that assume a fixed sample rate must declare it in `inputs()`. +- **References:** `nemo_curator/stages/audio/preprocessing/mono_conversion.py`; `nemo_curator/stages/audio/tagging/resample_audio.py`; `nemo_curator/stages/audio/segmentation/vad_segmentation.py`; `nemo_curator/stages/audio/postprocessing/timestamp_mapper.py`. + +### 2.7 Streaming, throughput, benchmarking + +- Streaming execution with per-stage resource configuration is the default model + for scalable audio pipelines. +- Emit stage metrics through the supported metrics path, JSON-shaped. For + multi-node audio runs include worker/node/GPU identity and latency + distributions (p50/p95/p99) plus audio-seconds-per-wall-second, not just + totals/averages. +- Standalone audio benchmark scripts belong in the `benchmarking/` flow with + config entries and comparable parameters - not freestanding. +- **References:** `nemo_curator/stages/audio/metrics/{wer,bandwidth,squim}.py`; `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md`, `benchmarking/README.md`; rules `executors.mdc`; fern `reference/infrastructure/{execution-backends,per-stage-runtime,monitoring,gpu-processing}.mdx`. + +### 2.8 Tutorials and docs + +- Audio tutorials (`tutorials/audio/`) must be runnable, minimal, and use public + APIs only - no private APIs, no machine-local absolute paths, no committed + noisy notebook output, no executor mismatch between a notebook and its CLI. +- Don't duplicate pipeline-construction logic between a tutorial entrypoint and + an `examples/` script; extract one shared builder both call. +- Every config key documented for users must actually reach a stage; validate + config-to-stage mapping and fail (or warn) on unused keys. +- Don't commit generated analysis/scratch docs into the source tree. +- **References:** `tutorials/audio/README.md`; `tutorials/audio/{alm,audio_pretrain,callhome_diar,fleurs,readspeech,single_speaker_filter,tagging}/README.md`; `tutorials/audio/readspeech/readspeech_tutorial.ipynb`; `nemo_curator/stages/audio/README.md`; fern `curate-audio/tutorials/{index,beginner,readspeech,alm}.mdx`. + +### 2.9 Tests and standards (CONTRIBUTING + coding-standards) + +- Source changes need accompanying tests; `tests/stages/audio/` mirrors + `nemo_curator/stages/audio/`. The project enforces 80% coverage under + `nemo_curator/`. GPU-only audio tests use `@pytest.mark.gpu`; reuse fixtures + under `tests/fixtures/audio/` and sample data under `tests/data/audio/`. +- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header. +- Ruff clean at 119-char lines; type annotations on functions; loguru for logs. +- Every commit is DCO signed-off. +- **References:** `CONTRIBUTING.md`; rules `coding-standards.mdc`, `modality-structure.mdc`; `tests/stages/audio/`. + +### 2.10 PR size and reviewability + +- Large multi-stage audio PRs are hard to review and defend. Prefer splitting by + stage / module / tutorial / benchmark. If a single PR must remain, organize it + as if sliced: no duplicated builder logic, clear module ownership, a reviewable + commit sequence. + +### Smaller recurring nits + +- Missing copyright headers on new files. +- Top-level imports that should be lazy/guarded (very common on audio model modules). +- Long inline prompt/config strings that should be module constants. +- Dataclass fields validated in `__post_init__` that could just be typed. +- Dead conditionals / redundant truthiness checks. +- Pytests added for tutorials that don't warrant them. +- Debug-only knobs (e.g. `max_utterances_per_shard`) presented as features - document them as debug throttles. +- `trust_remote_code=True` hardcoded on a model load - expose as a parameter. + +--- + +## 3. Severity mapping + +| Severity | Meaning | +|----------|---------| +| P0 | Merge blocker: crash/data-loss on a valid audio config, secret leaked to logs, import breaks a supported install | +| P1 | Fix before merge: undeclared audio deps, missing required tests/coverage, env-specific I/O in reusable reader, PR too large | +| P2 | Should fix: duplicated pipeline construction, inefficient writer batching, incomplete throughput metrics, documented-but-unwired config | +| P3 | Nice to have: temporary model/compat workaround needing test+comment, debug knob docs | + +--- + +## 4. Pre-review corpus (required): learn from post-#1608 audio PRs + +PR [#1608](https://github.com/NVIDIA-NeMo/Curator/pull/1608) (`AudioBatch -> +AudioTask` redesign) reset the audio stage contracts. Reviewer feedback on audio +PRs **after** #1608 is the best predictor of what to flag next. **Before every +audio PR review**, build (or refresh) the consolidated corpus of that feedback. +Do not skip this step, even when a prior `audio_pr_corpus_*.md` already exists +on disk - rerun the pull (incremental by default) and render today's file: + +```bash +# 1) discover audio PRs after #1608 (open + closed/merged) and pull their +# reviews + inline comments into .curator-pr-review/audio-corpus/ +.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 + +# 2) render one consolidated, reviewer-comment corpus +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py +``` + +`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are +monotonic in time, so number > 1608 == opened after #1608), keeps the ones whose +changed files touch audio paths, and pulls each one's reviews, inline comments, +and issue comments. It is incremental - reruns skip PRs already on disk and only +fetch new ones (use `--refresh` to re-pull, e.g. to refresh open PRs). +`build_corpus.py` writes +`.curator-pr-review/audio-corpus/audio_pr_corpus_.md`: one section per +audio PR (number, title, state, author, link) with every reviewer comment +verbatim, anchored to `path:line`, plus a recurring-themes tally. + +Use the corpus to (a) recognize patterns reviewers repeatedly raise (the lenses +in section 2 came from exactly this), and (b) check whether the PR in front of +you repeats a mistake already called out elsewhere. It is read-only context; it +never auto-posts anything. If the corpus scripts fail or the consolidated file +is missing, stop the review and fix the failure before writing findings. + +--- + +## 5. GitHub data + scripts reference + +`scripts/ensure_repo.sh [CLONE_DIR]` - reuse an existing Curator checkout or +shallow-clone one; prints `CURATOR_REPO=` on its last line. + +`scripts/pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO]` pulls into the +scratch outdir: + +| File | Endpoint | +|------|----------| +| `pr_gh_latest.json` | `gh pr view --json ...` | +| `pr_reviews_latest.json` | `repos/:owner/:repo/pulls//reviews` | +| `pr_review_comments_latest.json` | `repos/:owner/:repo/pulls//comments` (inline) | +| `pr_issue_comments_latest.json` | `repos/:owner/:repo/issues//comments` | +| `pr_files_latest.json` | `repos/:owner/:repo/pulls//files` | +| `pr_commits_latest.json` | `repos/:owner/:repo/pulls//commits` | +| `pr_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) | + +The REST inline-comments endpoint does not expose resolve/outdate state; the +GraphQL thread payload does. The builder joins them by comment `databaseId` +(== REST `id`), falling back to a `(path, body-prefix)` match when an older +thread dump lacks `databaseId`. + +`scripts/build_digest.py [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` +renders the working digest + the prior-open-threads context file (you add your +own findings). + +`scripts/pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]` +(incremental - skips PRs already on disk; `--refresh` re-pulls) and +`scripts/build_corpus.py [--outdir DIR] [--today YYYY-MM-DD]` **must** run before +every review to build the post-#1608 corpus (section 4; SKILL.md step 3). + +Default outdir: `.curator-pr-review/` (scratch; gitignored, safe to delete). diff --git a/.cursor/skills/review-curator-audio-pr/scripts/README.md b/.cursor/skills/review-curator-audio-pr/scripts/README.md new file mode 100644 index 0000000000..dd40c15ef0 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/README.md @@ -0,0 +1,78 @@ +# Scripts - review-curator-audio-pr + +Helpers for a **reviewer** reviewing someone else's audio Curator PR. They pull +the PR (and prior audio review history) and summarise it as context; you write +the actual review findings. All require the GitHub CLI (`gh`) authenticated +against github.com, and write to a scratch directory (default +`.curator-pr-review/`) that is safe to delete; do not commit its contents. + +## ensure_repo.sh + +```bash +eval "$(.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh | tail -1)" # CURATOR_REPO= +``` + +Avoids cloning the whole repo unnecessarily: reuses an existing checkout if you +are already inside one or there is one under the current directory tree +(bounded by `MAXDEPTH`, default 4); only then `git clone --depth 1` (no full +history) from . Prints +`CURATOR_REPO=` on the last line. + +## pr_review_pull.sh + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO] +``` + +Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue +`comments`, `files`, `commits`) plus the GraphQL review threads into +`pr_*_latest.json` (and timestamped snapshots for delta analysis). +**Audio-only:** it pulls metadata + files first and aborts (exit 3) if the PR +touches no audio path, so the skill never reviews a non-audio PR. + +## build_digest.py + +```bash +.cursor/skills/review-curator-audio-pr/scripts/build_digest.py [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS] +``` + +Joins the `pr_*_latest.json` files and writes two context files: +`curator_pr_fresh_review_.md` (working digest: PR state, diff, and +existing reviews/comments with OPEN/OUTDATED/RESOLVED status, plus a placeholder +for your findings) and `curator_pr_github_comment_queue_.md` (the +still-open threads other reviewers left, so you don't duplicate them). + +- `--prev-head` sets the SHA in the "commits since prior reviewed head" header. +- `--baseline-ts ` (a prior pull's timestamp suffix) marks + comments/reviews/commits NEW vs already-seen. + +The thread join uses comment `databaseId` when present, else a +(path, body-prefix) fallback, so both GraphQL thread-dump shapes classify +correctly. It also re-checks the audio-only guard and aborts on a non-audio PR. + +## pull_audio_pr_corpus.sh + build_corpus.py (required pre-review corpus) + +Run these **before every audio PR review** (SKILL.md step 3). Do not skip even +when `.curator-pr-review/audio-corpus/` already has data - the pull is +incremental and `build_corpus.py` must render today's consolidated file. + +```bash +.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608 +.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today +``` + +`pull_audio_pr_corpus.sh` lists every PR with number > `--since` (PR numbers are +monotonic in time, so this is "opened after that PR"), keeps the ones touching +audio paths, and pulls each one's reviews + inline + issue comments into +`.curator-pr-review/audio-corpus/`. It is **incremental**: PRs already pulled are +skipped (and non-audio candidates are remembered in `_non_audio_prs.txt`), so a +rerun only fetches PRs new since last time. Pass `--refresh` to re-pull +everything (e.g. to pick up new comments on still-open PRs). +`build_corpus.py` consolidates them into +`audio_pr_corpus_.md`: one section per audio PR with every reviewer comment +verbatim (anchored to path:line) plus a recurring-themes tally. Read-only +context; it posts nothing. See `../knowledge-sources.md` section 4. + +The single-PR scripts are **audio-only**: both `pr_review_pull.sh` and +`build_digest.py` abort on a PR that touches no audio path. The audio review +lenses live in `../knowledge-sources.md`. diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py new file mode 100755 index 0000000000..2df94e3e78 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py @@ -0,0 +1,159 @@ +#!/usr/bin/env python3 +"""Consolidate the audio PR corpus pulled by pull_audio_pr_corpus.sh. + +Reads the per-PR JSON in the corpus dir and writes one markdown file with every +reviewer comment (verbatim, anchored to path:line) grouped by PR, plus a +recurring-themes keyword tally. Read-only context for pre-review; posts nothing. + +Usage: build_corpus.py [--outdir DIR] [--today YYYY-MM-DD] +""" +from __future__ import annotations + +import argparse +import datetime as dt +import json +import re +from pathlib import Path + +BOT_LOGINS = {"greptile-apps[bot]", "copy-pr-bot[bot]", "github-actions[bot]"} + +# (label, regex) recurring-theme buckets, mirroring knowledge-sources.md lenses. +THEMES = [ + ("setup/teardown lifecycle", r"setup_on_node|\bsetup\(|teardown|_setup_done"), + ("optional/lazy imports", r"top[- ]level import|lazy import|import .* fails|optional (dep|extra)"), + ("dependency declaration/pins", r"pyproject|optional[- ]?group|==|version pin|requirement"), + ("stage contract inputs/outputs", r"inputs\(\)|outputs\(\)|validate_input|NotImplementedError"), + ("batch_size / process_batch", r"batch_size|process_batch"), + ("memory / serialization", r"ndarray|json\.dumps|serializ|waveform|tensor|OOM|memory"), + ("fsspec / cloud I/O", r"fsspec|url_to_fs|s3|gcs|http"), + ("secrets / logging", r"token|secret|credential|password|redact"), + ("tests / coverage", r"\btest|coverage|pytest|fixture"), + ("copyright / lint", r"copyright|header|ruff|lint"), + ("naming / convention", r"naming|rename|convention|AudioTask|AudioBatch"), + ("trust_remote_code", r"trust_remote_code"), +] + + +def load(p: Path): + return json.loads(p.read_text()) if p.exists() else [] + + +def shorten(s: str, n: int = 1200) -> str: + s = (s or "").strip() + return s if len(s) <= n else s[:n] + " […]" + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--outdir", default=".curator-pr-review/audio-corpus") + ap.add_argument("--today", default=dt.date.today().isoformat()) + args = ap.parse_args() + + outdir = Path(args.outdir) + nums_file = outdir / "_audio_pr_numbers.txt" + if not nums_file.exists(): + raise SystemExit(f"no {nums_file}; run pull_audio_pr_corpus.sh first") + numbers = [int(x) for x in nums_file.read_text().split() if x.strip()] + numbers.sort(reverse=True) + + date_us = args.today.replace("-", "_") + theme_counts = {label: 0 for label, _ in THEMES} + theme_rx = [(label, re.compile(rx, re.I)) for label, rx in THEMES] + + out: list[str] = [] + out.append(f"# Audio PR review corpus (post-#1608) - {args.today}\n") + out.append( + "Consolidated reviewer feedback on audio PRs opened after #1608 " + "(open + closed/merged). Read-only pre-review context: recognise the " + "patterns reviewers repeatedly raise, and check the PR in front of you " + "against them. Bot reviewers are marked `[bot]`.\n" + ) + out.append(f"Audio PRs in corpus: **{len(numbers)}** " + f"({', '.join('#' + str(n) for n in numbers)})\n") + + per_pr_sections: list[str] = [] + total_comments = 0 + for n in numbers: + gh = load(outdir / f"pr{n}_gh.json") + if isinstance(gh, list): + gh = gh[0] if gh else {} + reviews = load(outdir / f"pr{n}_reviews.json") + rcomments = load(outdir / f"pr{n}_review_comments.json") + icomments = load(outdir / f"pr{n}_issue_comments.json") + + author = (gh.get("author") or {}).get("login", "?") + state = gh.get("state", "?") + title = gh.get("title", "") + url = gh.get("url", f"https://github.com/NVIDIA-NeMo/Curator/pull/{n}") + + sec: list[str] = [] + sec.append(f"## PR #{n} - {title}\n") + sec.append(f"- state: **{state}** author: @{author} " + f"created: {gh.get('createdAt','?')} link: {url}\n") + + # Review summaries (non-empty bodies only). + rev_bodies = [r for r in reviews if (r.get("body") or "").strip()] + if rev_bodies: + sec.append("### Review summaries\n") + for r in rev_bodies: + login = (r.get("user") or {}).get("login", "?") + bot = " `[bot]`" if login in BOT_LOGINS else "" + sec.append(f"- **@{login}{bot}** [{r.get('state','')}] " + f"{r.get('submitted_at','')}:\n") + sec.append(f" > {shorten(r.get('body'))}\n") + for label, rx in theme_rx: + if rx.search(r.get("body") or ""): + theme_counts[label] += 1 + + # Inline comments grouped by file. + by_file: dict[str, list] = {} + for c in rcomments: + by_file.setdefault(c.get("path", "?"), []).append(c) + if by_file: + sec.append("### Inline review comments\n") + for path in sorted(by_file): + sec.append(f"#### `{path}`\n") + for c in sorted(by_file[path], + key=lambda x: (x.get("line") or x.get("original_line") or 0)): + login = (c.get("user") or {}).get("login", "?") + bot = " `[bot]`" if login in BOT_LOGINS else "" + line = c.get("line") or c.get("original_line") or "?" + body = c.get("body") or "" + total_comments += 1 + sec.append(f"- **@{login}{bot}** line {line} " + f"([link]({c.get('html_url','')})):\n") + sec.append(f" > {shorten(body)}\n") + for label, rx in theme_rx: + if rx.search(body): + theme_counts[label] += 1 + + # Top-level issue comments (skip pure CI noise). + human_ic = [c for c in icomments + if (c.get("user") or {}).get("login") not in BOT_LOGINS + and (c.get("body") or "").strip()] + if human_ic: + sec.append("### Discussion (top-level)\n") + for c in human_ic: + login = (c.get("user") or {}).get("login", "?") + sec.append(f"- **@{login}** {c.get('created_at','')}: " + f"{shorten(c.get('body'), 600)}\n") + + per_pr_sections.append("\n".join(sec)) + + # Theme tally up top. + out.append("## Recurring themes (comment hits across the corpus)\n") + out.append("| Theme | Comments mentioning it |") + out.append("|-------|------------------------|") + for label, _ in THEMES: + out.append(f"| {label} | {theme_counts[label]} |") + out.append(f"\nTotal inline review comments scanned: **{total_comments}**\n") + out.append("---\n") + out.extend(per_pr_sections) + + outpath = outdir / f"audio_pr_corpus_{date_us}.md" + outpath.write_text("\n".join(out) + "\n", encoding="utf-8") + print(f"wrote {outpath} ({outpath.stat().st_size} bytes; {len(numbers)} PRs)") + + +if __name__ == "__main__": + main() diff --git a/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py new file mode 100755 index 0000000000..7b920acd18 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/build_digest.py @@ -0,0 +1,353 @@ +#!/usr/bin/env python3 +"""Build a Curator PR fresh-review digest + open-comment queue. + +Usage: build_digest.py [--outdir DIR] [--today YYYY-MM-DD] + [--prev-head SHA] [--baseline-ts TS] + +Reads the pr_*_latest.json files written by pr_review_pull.sh from --outdir +(default .curator-pr-review) and writes into the same directory: + curator_pr_fresh_review_.md full digest + curator_pr_github_comment_queue_.md open-thread paste queue + +The GraphQL thread payload is joined to REST inline comments by `databaseId` +when present, otherwise by a (path, body-prefix) match, so both thread-dump +shapes classify correctly. Pass --baseline-ts (a prior pull's timestamp +suffix) to mark comments / reviews / commits NEW vs already-seen. +""" +from __future__ import annotations + +import argparse +import datetime as dt +import json +import re +from pathlib import Path + +BODY_KEY_LEN = 120 + +# This skill is audio-only; a PR must touch at least one of these paths. +AUDIO_RE = re.compile( + r"^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|" + r"tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|" + r"benchmarking/.*([Aa]udio|ALM|alm))" +) + + +def load(path: Path) -> object: + return json.loads(path.read_text()) + + +def load_baseline_ids(baseline, key): + if baseline is None or not baseline.exists(): + return set() + return {entry[key] for entry in load(baseline) if key in entry} + + +def build_thread_index(threads_payload): + """Return {by_dbid, by_pathbody, nthreads}, robust to both GraphQL shapes.""" + idx = {"by_dbid": {}, "by_pathbody": {}, "nthreads": 0} + try: + nodes = threads_payload["data"]["repository"]["pullRequest"]["reviewThreads"]["nodes"] + except (KeyError, TypeError): + return idx + idx["nthreads"] = len(nodes) + for thread in nodes: + cnodes = thread["comments"]["nodes"] + path = thread.get("path") + for pos, c in enumerate(cnodes): + meta = { + "thread_id": thread.get("id"), + "is_resolved": thread.get("isResolved", False), + "is_outdated": thread.get("isOutdated", False), + "thread_comment_count": len(cnodes), + "is_first_in_thread": pos == 0, + } + db = c.get("databaseId") + if db is not None: + idx["by_dbid"][int(db)] = meta + body = (c.get("body") or "").strip() + if body: + idx["by_pathbody"].setdefault((path, body[:BODY_KEY_LEN]), meta) + return idx + + +def thread_meta(idx, c): + m = idx["by_dbid"].get(c["id"]) + if m is not None: + return m + body = (c.get("body") or "").strip() + return idx["by_pathbody"].get((c.get("path"), body[:BODY_KEY_LEN])) + + +def shorten(body, n=600): + body = (body or "").strip() + if len(body) <= n: + return body + return body[:n].rstrip() + "\n[...truncated...]" + + +def status_of(meta): + if meta is None: + return "ORPHAN" + if meta["is_resolved"]: + return "RESOLVED" + if meta["is_outdated"]: + return "OUTDATED" + return "OPEN" + + +def p(parts, *lines): + parts.extend(lines) + + +def area_of(path): + """Group a changed file into a coarse audio area for the overview table.""" + parts = path.split("/") + if path.startswith("nemo_curator/stages/audio/") and len(parts) >= 4: + return f"stages/audio/{parts[3]}" + if path.startswith("nemo_curator/tasks/"): + return "tasks" + if path.startswith("tutorials/audio/") and len(parts) >= 3: + return f"tutorials/audio/{parts[2]}" + if path.startswith("tests/"): + return "tests" + if path.startswith("benchmarking/"): + return "benchmarking" + if path.startswith("fern/"): + return "docs (fern)" + return parts[0] if parts else path + + +def build(pr, outdir, today, baseline_ts, prev_head): + def latest(kind): + return outdir / f"pr{pr}_{kind}_latest.json" + + gh = load(latest("gh")) + reviews = load(latest("reviews")) + review_comments = load(latest("review_comments")) + issue_comments = load(latest("issue_comments")) + files = load(latest("files")) + commits = load(latest("commits")) + + if files and not any(AUDIO_RE.search(f.get("filename", "")) for f in files): + raise SystemExit( + f"PR {pr} touches no audio path; review-curator-audio-pr is audio-only. " + "Aborting (audio paths: nemo_curator/stages/audio/, " + "nemo_curator/tasks/audio_task.py, tutorials/audio/, " + "tests/stages/audio/, audio benchmarks)." + ) + threads_path = latest("review_threads") + idx = build_thread_index(load(threads_path)) if threads_path.exists() else build_thread_index({}) + + def base(kind): + return outdir / f"pr{pr}_{kind}_{baseline_ts}.json" if baseline_ts else None + + base_comment_ids = load_baseline_ids(base("review_comments"), "id") + base_review_ids = load_baseline_ids(base("reviews"), "id") + base_issue_ids = load_baseline_ids(base("issue_comments"), "id") + base_commit_shas = set() + bc = base("commits") + if bc and bc.exists(): + base_commit_shas = {c["sha"] for c in load(bc) if "sha" in c} + + head_oid = gh.get("headRefOid", "") + base_oid = gh.get("baseRefOid", "") + prev8 = (prev_head or "")[:8] or "(none)" + date_us = today.replace("-", "_") + + # ----- DIGEST ----- + d = [] + p(d, f"# Curator PR {pr} Fresh Review - {today}", "") + p(d, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "") + p(d, f"Current PR head reviewed: `{head_oid}`", "") + p(d, f"Base recorded by GitHub metadata: `{base_oid}`", "") + if prev_head: + p(d, f"Previous reviewed head: `{prev_head}`", "") + p(d, "") + + # ----- PR OVERVIEW (what the PR is about; present BEFORE any review) ----- + author = (gh.get("author") or {}).get("login", "?") + p(d, "## What this PR does (overview)", "") + p(d, f"**{gh.get('title','')}** - by @{author} " + f"({gh.get('state')}, +{gh.get('additions')}/-{gh.get('deletions')} " + f"across {gh.get('changedFiles')} files, {len(commits)} commits)", "") + + p(d, "### Author's description", "") + pr_body = (gh.get("body") or "").strip() + p(d, pr_body if pr_body else "_(no PR description provided)_", "") + + p(d, "### Areas touched", "") + areas = {} + for f in files: + e = areas.setdefault(area_of(f["filename"]), {"n": 0, "add": 0, "dele": 0}) + e["n"] += 1 + e["add"] += f.get("additions", 0) + e["dele"] += f.get("deletions", 0) + p(d, "| Area | files | +/- |", "|---|---|---|") + for a in sorted(areas): + e = areas[a] + p(d, f"| {a} | {e['n']} | +{e['add']} -{e['dele']} |") + p(d, "") + + p(d, "### Plain-language summary (reviewer writes this BEFORE any findings)", "") + p(d, "_After reading the knowledge sources (knowledge-sources.md), explain in " + "detail, in your own words, what this PR changes and why: the " + "problem it solves, the main audio stages/files it adds or modifies, key " + "design decisions, new dependencies/config/APIs, and the blast radius " + "(what could regress). Present this overview first; only then move to the " + "findings below._", "") + + p(d, "## PR state at review time", "") + p(d, "| Field | Value |", "|---|---|") + p(d, f"| state | {gh.get('state')} |") + p(d, f"| isDraft | {gh.get('isDraft')} |") + p(d, f"| reviewDecision | {gh.get('reviewDecision') or '(none)'} |") + p(d, f"| mergeable / mergeStateStatus | {gh.get('mergeable')} / {gh.get('mergeStateStatus')} |") + p(d, f"| changedFiles | {gh.get('changedFiles')} |") + p(d, f"| additions / deletions | +{gh.get('additions')} / -{gh.get('deletions')} |") + p(d, f"| commits | {len(commits)} |") + p(d, f"| inline review comments total | {len(review_comments)} |") + p(d, f"| reviews submitted | {len(reviews)} |") + p(d, f"| top-level issue comments | {len(issue_comments)} |") + p(d, f"| updatedAt | {gh.get('updatedAt')} |", "") + + new_commits = [c for c in commits if c["sha"] not in base_commit_shas] if base_commit_shas else commits + p(d, f"## Commits since prior reviewed head `{prev8}` ({len(new_commits)} new)", "") + p(d, "```") + for c in new_commits: + sha = c["sha"][:8] + msg = (c["commit"]["message"] or "").splitlines()[0] + cdate = (c["commit"]["author"] or {}).get("date", "") + p(d, f"{sha} {msg} ({cdate[:10]})") + p(d, "```", "") + + p(d, "Files changed in current PR head (`git diff ..`):", "") + p(d, "| File | +/- | status |", "|---|---|---|") + for f in sorted(files, key=lambda x: x["filename"]): + p(d, f"| `{f['filename']}` | +{f['additions']} -{f['deletions']} | {f['status']} |") + p(d, "") + + p(d, "## Existing reviews (by other reviewers)", "") + review_by_id = {r["id"]: r for r in reviews} + for r in sorted(reviews, key=lambda x: x.get("submitted_at") or ""): + marker = " (NEW)" if base_review_ids and r["id"] not in base_review_ids else "" + commit = (r.get("commit_id") or "")[:8] or "n/a" + p(d, f"### #{r['id']} by @{r['user']['login']} state={r['state']} " + f"commit={commit} submitted={r.get('submitted_at')}{marker}") + body = (r.get("body") or "").strip() + if body: + p(d, "", shorten(body, 1500)) + p(d, "") + + p(d, "## Existing inline comments (by other reviewers)", "") + p(d, f"Total: {len(review_comments)} comments across {idx['nthreads'] or '?'} threads.", "") + by_status = {"OPEN": [], "OUTDATED": [], "RESOLVED": [], "ORPHAN": []} + for c in review_comments: + by_status[status_of(thread_meta(idx, c))].append(c) + for s in ("OPEN", "OUTDATED", "RESOLVED", "ORPHAN"): + p(d, f"- {s}: {len(by_status[s])}") + p(d, "") + + by_path = {} + for c in review_comments: + by_path.setdefault(c["path"], []).append(c) + for path in sorted(by_path): + p(d, f"### `{path}`", "") + for c in sorted(by_path[path], key=lambda x: (x.get("line") or x.get("original_line") or 0, x["created_at"])): + meta = thread_meta(idx, c) + status = status_of(meta) + new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else "" + line = c.get("line") or c.get("original_line") + review = review_by_id.get(c.get("pull_request_review_id"), {}) + commit = (c.get("commit_id") or "")[:8] + in_reply = c.get("in_reply_to_id") + reply_marker = f" reply->{in_reply}" if in_reply else "" + tcount = meta["thread_comment_count"] if meta else 1 + tpos = " (root)" if meta and meta["is_first_in_thread"] else (" (reply)" if meta else "") + p(d, f"- **#{c['id']}** @{c['user']['login']} {c['created_at']} " + f"line={line} commit={commit} status=**{status}**{new} " + f"thread_comments={tcount}{tpos} review_state={review.get('state', '?')}{reply_marker}") + p(d, f" url: {c['html_url']}") + body = shorten(c.get("body") or "", 700).replace("\n", "\n > ") + p(d, f" > {body}", "") + p(d, "") + + p(d, "## Existing issue comments (by other reviewers)", "") + for c in sorted(issue_comments, key=lambda x: x["created_at"]): + new = " NEW" if base_issue_ids and c["id"] not in base_issue_ids else "" + p(d, f"### #{c['id']} by @{c['user']['login']} {c['created_at']}{new}", "") + p(d, shorten(c.get("body") or "", 1500), "") + + p(d, "## My findings (your review)", "") + p(d, "_Add your findings here as you review. Classify each P0-P3, cite " + "path:line on the current head, and propose a concrete fix. See " + "templates.md section C._", "") + p(d, "## Verdict", "") + p(d, "_APPROVE / COMMENT / REQUEST CHANGES + blockers._", "") + + digest_path = outdir / f"curator_pr{pr}_fresh_review_{date_us}.md" + digest_path.write_text("\n".join(d) + "\n", encoding="utf-8") + print(f"wrote {digest_path} ({digest_path.stat().st_size} bytes)") + + # ----- COMMENT QUEUE (open root threads only) ----- + q = [] + p(q, f"# Curator PR {pr} Open Review Threads - {today}", "") + p(q, f"Review target: https://github.com/NVIDIA-NeMo/Curator/pull/{pr}", "") + p(q, f"Current PR head: `{head_oid}`", "") + p(q, "Threads other reviewers left that are still unresolved on the current " + "head. Scan these before adding your own comments so you do not duplicate, " + "and check whether the author addressed them. Stale (outdated/resolved) " + "threads are listed at the bottom for context only.", "") + + def is_root(c): + return (thread_meta(idx, c) or {}).get("is_first_in_thread", True) + + open_root = [c for c in review_comments if status_of(thread_meta(idx, c)) == "OPEN" and is_root(c)] + open_root.sort(key=lambda c: (c["path"], c.get("line") or c.get("original_line") or 0, c["created_at"])) + for i, c in enumerate(open_root, 1): + meta = thread_meta(idx, c) or {} + line = c.get("line") or c.get("original_line") + new = " NEW" if base_comment_ids and c["id"] not in base_comment_ids else "" + p(q, f"## Thread {i} - {c['path']}:{line} by @{c['user']['login']}{new}", "") + p(q, f"Thread: {c['html_url']}", "") + if meta.get("thread_comment_count", 1) > 1: + p(q, f"Thread has {meta['thread_comment_count']} comments - see the digest for the full thread.", "") + p(q, f"File: `{c['path']}`", "") + p(q, f"Line: `{line}`", "") + p(q, "Reviewer text:", "") + p(q, "```text", (c.get("body") or "").strip(), "```", "") + + p(q, "## Stale (outdated/resolved) comments", "") + p(q, "These threads pre-date the current head and are outdated/resolved. Listed " + "for traceability only - no action needed unless a reviewer reopens.", "") + stale = [c for c in review_comments + if status_of(thread_meta(idx, c)) in ("OUTDATED", "RESOLVED") and is_root(c)] + stale.sort(key=lambda c: (status_of(thread_meta(idx, c)), c["path"], c.get("original_line") or 0)) + for c in stale: + line = c.get("line") or c.get("original_line") + body = shorten((c.get("body") or "").strip(), 200).replace("\n", " ") + p(q, f"- [{status_of(thread_meta(idx, c))}] `{c['path']}:{line}` " + f"@{c['user']['login']} ({c['html_url']}): {body}") + p(q, "") + + queue_path = outdir / f"curator_pr{pr}_github_comment_queue_{date_us}.md" + queue_path.write_text("\n".join(q) + "\n", encoding="utf-8") + print(f"wrote {queue_path} ({queue_path.stat().st_size} bytes)") + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("pr", type=int, help="PR number") + ap.add_argument("--outdir", default=".curator-pr-review", + help="directory holding pr_*_latest.json (default .curator-pr-review)") + ap.add_argument("--today", default=None, help="ISO date for filenames; default today UTC") + ap.add_argument("--baseline-ts", default=None, + help="timestamp suffix of a prior pull to mark NEW vs already-seen") + ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta") + args = ap.parse_args() + today = args.today or dt.datetime.utcnow().date().isoformat() + build(args.pr, Path(args.outdir), today, args.baseline_ts, args.prev_head) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh new file mode 100755 index 0000000000..4b9c914c39 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# Make a local NVIDIA-NeMo/Curator checkout available WITHOUT cloning the whole +# repo unnecessarily. +# +# Strategy (cheapest first): +# 1. If we are already inside a Curator checkout, use it. +# 2. Else search the current directory tree (bounded depth) for an existing +# Curator checkout and reuse it. +# 3. Else shallow-clone (--depth 1, no full history) into ./Curator. +# +# Prints diagnostics to stderr and, on the LAST stdout line: +# CURATOR_REPO= +# +# Usage: ensure_repo.sh [CLONE_DIR] (default CLONE_DIR=Curator) +# MAXDEPTH= bound the downward search (default 4) +# +# Clone source: https://github.com/NVIDIA-NeMo/Curator +set -euo pipefail + +REPO_URL="https://github.com/NVIDIA-NeMo/Curator.git" +CLONE_DIR="${1:-Curator}" +MAXDEPTH="${MAXDEPTH:-4}" +MARKER="nemo_curator/stages/audio/README.md" + +is_curator() { [[ -f "$1/${MARKER}" && -d "$1/.cursor/rules" ]]; } +abspath() { (cd "$1" && pwd); } + +# 1) Already inside a Curator checkout? +if top="$(git rev-parse --show-toplevel 2>/dev/null)" && is_curator "${top}"; then + echo "ensure_repo: already inside Curator checkout (${top}); not cloning." >&2 + echo "CURATOR_REPO=${top}"; exit 0 +fi +if is_curator "."; then + echo "ensure_repo: current directory is a Curator checkout; not cloning." >&2 + echo "CURATOR_REPO=$(abspath .)"; exit 0 +fi + +# 2) Existing checkout somewhere under the current tree? +while IFS= read -r marker; do + cand="${marker%/${MARKER}}" + if is_curator "${cand}"; then + echo "ensure_repo: reusing existing checkout at ${cand}; not cloning." >&2 + echo "CURATOR_REPO=$(abspath "${cand}")"; exit 0 + fi +done < <(find . -maxdepth "${MAXDEPTH}" -path "*/${MARKER}" 2>/dev/null) + +# 3) Reuse a prior clone dir if present, else shallow-clone. +if is_curator "${CLONE_DIR}"; then + echo "ensure_repo: reusing ${CLONE_DIR}; not cloning." >&2 + echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")"; exit 0 +fi + +command -v git >/dev/null || { echo "ensure_repo: git not found" >&2; exit 2; } +echo "ensure_repo: no local checkout found; shallow-cloning ${REPO_URL} -> ${CLONE_DIR}" >&2 +git clone --depth 1 "${REPO_URL}" "${CLONE_DIR}" +is_curator "${CLONE_DIR}" || { echo "ensure_repo: clone did not produce a valid checkout" >&2; exit 1; } +echo "CURATOR_REPO=$(abspath "${CLONE_DIR}")" diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh new file mode 100755 index 0000000000..9092688f89 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh @@ -0,0 +1,109 @@ +#!/usr/bin/env bash +# Fetch GitHub data for an NVIDIA-NeMo/Curator PR review. +# +# Usage: pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO] +# +# Writes pr_*_latest.json (consumed by build_digest.py) plus timestamped +# snapshots so a prior pull is preserved for delta analysis. Pulls six REST +# endpoints and the GraphQL review threads (which carry isResolved/isOutdated). +# +# Requires the GitHub CLI (`gh`) authenticated against github.com. +set -euo pipefail + +PR="" +REPO="NVIDIA-NeMo/Curator" +OUTDIR=".curator-pr-review" + +while [[ $# -gt 0 ]]; do + case "$1" in + --outdir) OUTDIR="$2"; shift 2 ;; + --repo) REPO="$2"; shift 2 ;; + -h|--help) + echo "Usage: pr_review_pull.sh [--outdir DIR] [--repo OWNER/REPO]"; exit 0 ;; + *) PR="$1"; shift ;; + esac +done +[[ -n "${PR}" ]] || { echo "error: PR number required" >&2; exit 2; } + +command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; } + +mkdir -p "${OUTDIR}" +TS="$(date -u +%Y%m%dT%H%M%SZ)" +LOG="${OUTDIR}/pr${PR}_review_pull_${TS}.log" + +{ + echo "=== Pull start ${TS} PR=${PR} repo=${REPO} ===" + gh --version | head -1 +} | tee -a "${LOG}" + +pull_endpoint() { + local label="$1"; shift + local outfile="$1"; shift + echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}" + "$@" > "${outfile}" + printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}" +} + +GH_FIELDS="number,title,state,isDraft,mergeable,mergeStateStatus,headRefName,headRefOid,baseRefName,baseRefOid,additions,deletions,changedFiles,commits,reviewDecision,reviewRequests,labels,milestone,createdAt,updatedAt,closedAt,mergedAt,author,body,url,statusCheckRollup" + +pull_endpoint "pr ${PR} metadata" "${OUTDIR}/pr${PR}_gh_${TS}.json" \ + gh pr view "${PR}" --repo "${REPO}" --json "${GH_FIELDS}" +pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/files" + +# This skill is audio-only: abort before pulling anything else if the PR's +# changed files touch no audio path. +AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))' +if ! AUDIO_RE="${AUDIO_RE}" python3 - "${OUTDIR}/pr${PR}_files_${TS}.json" <<'PY' +import json, os, re, sys +files = [f.get("filename", "") for f in json.load(open(sys.argv[1]))] +rx = re.compile(os.environ["AUDIO_RE"]) +sys.exit(0 if any(rx.search(f) for f in files) else 1) +PY +then + echo "error: PR ${PR} touches no audio path; review-curator-audio-pr is audio-only." >&2 + echo " audio paths: nemo_curator/stages/audio/, nemo_curator/tasks/audio_task.py," >&2 + echo " tutorials/audio/, tests/stages/audio/, audio benchmarks. Aborting." >&2 + exit 3 +fi + +pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/reviews" +pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/comments" +pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \ + gh api --paginate "repos/${REPO}/issues/${PR}/comments" +pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \ + gh api --paginate "repos/${REPO}/pulls/${PR}/commits" + +OWNER="${REPO%%/*}" +NAME="${REPO##*/}" +pull_endpoint "graphql reviewThreads" "${OUTDIR}/pr${PR}_review_threads_${TS}.json" \ + gh api graphql \ + -f query='query($owner:String!,$repo:String!,$pr:Int!){ repository(owner:$owner,name:$repo){ pullRequest(number:$pr){ reviewThreads(first:100){ nodes{ id isResolved isOutdated isCollapsed line originalLine path comments(first:50){ nodes{ databaseId } } } } } } }' \ + -f owner="${OWNER}" -f repo="${NAME}" -F pr="${PR}" + +for kind in gh reviews review_comments issue_comments files commits review_threads; do + cp -f "${OUTDIR}/pr${PR}_${kind}_${TS}.json" "${OUTDIR}/pr${PR}_${kind}_latest.json" +done + +{ + echo "--- counts ---" + for kind in reviews review_comments issue_comments files commits; do + f="${OUTDIR}/pr${PR}_${kind}_${TS}.json" + n=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(len(d) if isinstance(d,list) else 1)" "${f}") + printf 'pr%s_%s: %s entries\n' "${PR}" "${kind}" "${n}" + done + echo "--- head SHA / activity ---" + python3 - "${OUTDIR}/pr${PR}_gh_latest.json" <<'PY' +import json, sys +gh = json.loads(open(sys.argv[1]).read()) +print(f"head_oid: {gh.get('headRefOid')}") +print(f"base_oid: {gh.get('baseRefOid')}") +print(f"state: {gh.get('state')} draft={gh.get('isDraft')} mergeStatus={gh.get('mergeStateStatus')} reviewDecision={gh.get('reviewDecision')}") +print(f"counts: files={gh.get('changedFiles')} +{gh.get('additions')}/-{gh.get('deletions')}") +print(f"updated_at: {gh.get('updatedAt')}") +PY +} | tee -a "${LOG}" + +echo "PR${PR}_REVIEW_PULL_DONE outdir=${OUTDIR} log=${LOG}" | tee -a "${LOG}" diff --git a/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh new file mode 100755 index 0000000000..a88cdb8659 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh @@ -0,0 +1,120 @@ +#!/usr/bin/env bash +# Discover audio-modality PRs opened AFTER a baseline PR (#1608 by default), +# open or closed/merged, and pull each one's reviews + comments into a corpus +# directory. Consolidate with build_corpus.py. +# +# PR numbers are monotonic in time, so "number > --since" == "opened after that +# PR". We then keep only PRs whose changed files touch audio paths. +# +# Incremental by default: PRs already pulled into OUTDIR are skipped, so reruns +# only fetch PRs that are new since last time. Pass --refresh to re-pull +# everything (e.g. to pick up new comments on open PRs). +# +# Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh] +# +# Requires the GitHub CLI (`gh`) authenticated against github.com. +set -euo pipefail + +REPO="NVIDIA-NeMo/Curator" +SINCE=1608 +OUTDIR=".curator-pr-review/audio-corpus" +LIMIT=600 +REFRESH=0 + +while [[ $# -gt 0 ]]; do + case "$1" in + --since) SINCE="$2"; shift 2 ;; + --outdir) OUTDIR="$2"; shift 2 ;; + --repo) REPO="$2"; shift 2 ;; + --limit) LIMIT="$2"; shift 2 ;; + --refresh) REFRESH=1; shift ;; + -h|--help) + echo "Usage: pull_audio_pr_corpus.sh [--since N] [--outdir DIR] [--repo OWNER/REPO] [--limit N] [--refresh]"; exit 0 ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +command -v gh >/dev/null || { echo "error: gh (GitHub CLI) not found" >&2; exit 2; } +mkdir -p "${OUTDIR}" + +# Cache of candidate PRs already classified as non-audio, so reruns don't +# re-fetch their file lists. Audio PRs are remembered by their pr_gh.json. +NONAUDIO_CACHE="${OUTDIR}/_non_audio_prs.txt" +touch "${NONAUDIO_CACHE}" +declare -A IS_NONAUDIO=() +while IFS= read -r x; do [[ -n "${x}" ]] && IS_NONAUDIO["${x}"]=1; done < "${NONAUDIO_CACHE}" + +# Audio-path test (extended regex over a PR's changed-file list). +AUDIO_RE='^(nemo_curator/stages/audio/|nemo_curator/tasks/audio_task\.py|tutorials/audio/|tests/stages/audio/|tests/tasks/test_audio|benchmarking/.*([Aa]udio|ALM|alm))' + +echo "=== corpus discovery: ${REPO} PRs > #${SINCE} (state=all, limit=${LIMIT}) ===" >&2 +gh pr list --repo "${REPO}" --state all --limit "${LIMIT}" \ + --json number,title,state,author,createdAt,updatedAt,url \ + > "${OUTDIR}/_all_prs.json" + +# Candidate PR numbers (> SINCE), newest first. +mapfile -t CANDIDATES < <(python3 - "${OUTDIR}/_all_prs.json" "${SINCE}" <<'PY' +import json, sys +prs = json.load(open(sys.argv[1])); since = int(sys.argv[2]) +nums = sorted((p["number"] for p in prs if p["number"] > since), reverse=True) +print("\n".join(str(n) for n in nums)) +PY +) +echo "candidates after #${SINCE}: ${#CANDIDATES[@]}" >&2 + +AUDIO_NUMS=() +for n in "${CANDIDATES[@]}"; do + # Incremental: reuse prior classification unless --refresh. + if [[ ${REFRESH} -eq 0 ]]; then + if [[ -f "${OUTDIR}/pr${n}_gh.json" ]]; then + AUDIO_NUMS+=("${n}"); echo " pr${n}: AUDIO (on disk, skip)" >&2; continue + fi + if [[ -n "${IS_NONAUDIO[$n]:-}" ]]; then + echo " pr${n}: non-audio (cached, skip)" >&2; continue + fi + fi + files_json="${OUTDIR}/pr${n}_files.json" + gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \ + > "${files_json}" 2>/dev/null || { echo " pr${n}: files fetch failed, skip" >&2; continue; } + if AUDIO_RE="${AUDIO_RE}" python3 - "${files_json}" <<'PY' +import json, os, re, sys +files = json.load(open(sys.argv[1])) +rx = re.compile(os.environ["AUDIO_RE"]) +sys.exit(0 if any(rx.search(f) for f in files) else 1) +PY + then + AUDIO_NUMS+=("${n}") + echo " pr${n}: AUDIO" >&2 + else + rm -f "${files_json}" + if [[ -z "${IS_NONAUDIO[$n]:-}" ]]; then + echo "${n}" >> "${NONAUDIO_CACHE}"; IS_NONAUDIO["${n}"]=1 + fi + fi +done + +echo "audio PRs after #${SINCE}: ${#AUDIO_NUMS[@]}" >&2 +printf '%s\n' "${AUDIO_NUMS[@]}" > "${OUTDIR}/_audio_pr_numbers.txt" + +pulled=0 +skipped=0 +for n in "${AUDIO_NUMS[@]}"; do + # Incremental: skip PRs whose data is already on disk unless --refresh. + if [[ ${REFRESH} -eq 0 && -f "${OUTDIR}/pr${n}_gh.json" \ + && -f "${OUTDIR}/pr${n}_reviews.json" \ + && -f "${OUTDIR}/pr${n}_review_comments.json" \ + && -f "${OUTDIR}/pr${n}_issue_comments.json" ]]; then + echo "--- pr${n}: already on disk, skip (use --refresh to re-pull) ---" >&2 + skipped=$((skipped + 1)); continue + fi + echo "--- pulling pr${n} reviews/comments ---" >&2 + gh pr view "${n}" --repo "${REPO}" \ + --json number,title,state,author,createdAt,updatedAt,mergedAt,closedAt,url,body \ + > "${OUTDIR}/pr${n}_gh.json" + gh api --paginate "repos/${REPO}/pulls/${n}/reviews" > "${OUTDIR}/pr${n}_reviews.json" + gh api --paginate "repos/${REPO}/pulls/${n}/comments" > "${OUTDIR}/pr${n}_review_comments.json" + gh api --paginate "repos/${REPO}/issues/${n}/comments" > "${OUTDIR}/pr${n}_issue_comments.json" + pulled=$((pulled + 1)) +done + +echo "AUDIO_PR_CORPUS_PULL_DONE outdir=${OUTDIR} audio_prs=${#AUDIO_NUMS[@]} pulled=${pulled} skipped=${skipped} refresh=${REFRESH}" >&2 diff --git a/.cursor/skills/review-curator-audio-pr/templates.md b/.cursor/skills/review-curator-audio-pr/templates.md new file mode 100644 index 0000000000..16af214cb0 --- /dev/null +++ b/.cursor/skills/review-curator-audio-pr/templates.md @@ -0,0 +1,158 @@ +# Output templates + +You are the reviewer. `scripts/build_digest.py` generates files A and B as +context; you author the findings (section C) and post them as your review. +The digest **leads with a PR overview** that you must complete and present +before any review comments. Replace `<...>` placeholders. Keep the structure +stable across review passes so a re-review diffs cleanly and only surfaces what +changed. + +## A. Working digest (generated, then you append findings) + +File: `curator_pr_fresh_review_.md`. It **leads with a PR +overview** (what the change does and why), then PR state, the diff, and all +existing review activity. Complete the overview's plain-language summary first +and present it before any findings; add your **Findings** section at the end. + +````markdown +# Curator PR Fresh Review - + +Review target: https://github.com/NVIDIA-NeMo/Curator/pull/ + +Current PR head reviewed: `` + +Base recorded by GitHub metadata: `` + +## What this PR does (overview) + +**** - by @<author> (<state>, +<a>/-<d> across <n> files, <k> commits) + +### Author's description +<verbatim PR body> + +### Areas touched +| Area | files | +/- | +|---|---|---| +| stages/audio/<area> | <n> | +<a> -<d> | + +### Plain-language summary (write this BEFORE any findings) +<In your own words, after reading the knowledge sources: what the PR changes and +why, the main stages/files, key design decisions, new deps/config/APIs, and +blast radius.> + +## PR state at review time + +| Field | Value | +|---|---| +| state | <OPEN/CLOSED/MERGED> | +| reviewDecision | <APPROVED/CHANGES_REQUESTED/none> | +| changedFiles | <n> | +| additions / deletions | +<a> / -<d> | +| commits | <n> | +| inline review comments total | <n> | +| reviews submitted | <n> | +| updatedAt | <iso> | + +## Commits (<k>) + +``` +<sha8> <subject> (<date>) +``` + +Files changed (`git diff <base>..<head>`): + +| File | +/- | status | +|---|---|---| +| `<path>` | +<a> -<d> | <added/modified> | + +## Existing reviews (by other reviewers) + +### #<id> by @<login> state=<STATE> commit=<sha8> submitted=<iso> +<body or empty> + +## Existing inline comments (by other reviewers) + +Total: <n> comments across <m> threads. + +- OPEN: <n> # still unresolved - do NOT duplicate these +- OUTDATED: <n> +- RESOLVED: <n> + +### `<path>` + +- **#<id>** @<login> <iso> line=<L> status=**<OPEN/OUTDATED/RESOLVED>** ... + url: <html_url> + > <body, truncated> + +## My findings (your review) + +### P0: <title> +<evidence: path:line on current head> - <why it's a blocker> - <concrete fix> + +### P1 / P2 / P3: <title> +<...> + +## Verdict +<APPROVE / COMMENT / REQUEST CHANGES, and the blockers if any> +```` + +## B. Prior open threads (generated context) + +File: `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md`. The still-unresolved +threads other reviewers already opened, so you can scan prior feedback before +adding your own. This is **context, not your output** - you are not replying to +these as the author; you read them to avoid duplicating and to check whether the +author addressed them. + +````markdown +# Curator PR <N> Open Review Threads - <YYYY-MM-DD> + +Current PR head: `<headRefOid>` + +Threads other reviewers left that are still unresolved on the current head. + +## Thread <i> - <path>:<line> by @<login> + +Link: <html_url> + +```text +<verbatim reviewer body> +``` + +## Stale (outdated/resolved) threads +- [<OUTDATED/RESOLVED>] `<path>:<line>` @<login> (<html_url>): <one-line body> +```` + +## C. Writing your review comments + +Turn each finding into a review comment you post on the PR. For each: + +- Anchor it to the exact `path:line-range` on the current head. +- Lead with severity (P0-P3) and a one-line statement of the problem. +- Explain *why* it matters (cite a `.cursor/rules/*.mdc` contract or + `stages/audio/README.md` when the change deviates). +- Propose a concrete fix, ideally with a `suggestion` block. +- Before posting, check the "OPEN" list in file A - if another reviewer already + raised it, don't repeat it (optionally +1 their thread instead). + +Example review comment: + +> **P0 (blocker).** `sharded_manifest_writer.py:101` calls `json.dumps(record)` +> while `record` can still hold the waveform `ndarray` when `keep_waveform=True`, +> which raises `TypeError: Object of type ndarray is not JSON serializable` and +> fails the whole shard. Drop the waveform key before serialising and guard +> `json.dumps` so a stray non-serialisable value names the offending key. +> Please also add a test under `tests/stages/audio/io/` covering +> `keep_waveform=True`. + +When you have an exact fix, attach a GitHub `suggestion` block so the author can +apply it in one click (anchor the comment to the line(s) it replaces): + +```suggestion + record = {k: v for k, v in record.items() if k != "waveform"} + line = json.dumps(record) +``` + +Then post the set as one PR review: inline comments for line-specific findings, +a top-level summary carrying the overall verdict (APPROVE / COMMENT / REQUEST +CHANGES). diff --git a/.gitignore b/.gitignore index 1ba2c180fd..f99780f76e 100644 --- a/.gitignore +++ b/.gitignore @@ -169,3 +169,6 @@ fern/product-docs/ # PDF parsing data (symlinked to external storage) pdf_parsing/ + +# NeMo Curator PR review skill scratch output +.curator-pr-review/