Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
257 changes: 257 additions & 0 deletions .cursor/skills/review-curator-audio-pr/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
---
name: review-curator-audio-pr
description: >-
Review someone else's NVIDIA-NeMo/Curator audio-modality pull request. This is
a reviewer's tool: given a PR number, it locates (or shallow-clones) the repo,
pulls the PR, builds the post-#1608 audio review corpus, diffs the changed
audio code, applies Curator's audio stage contracts and contribution standards,
shows what other reviewers have already raised (so you don't duplicate), and
helps you produce a structured set of review findings (P0-P3) to post as
review comments. Use when you are assigned
or pick up an audio Curator PR and need to review it (e.g. "review audio PR
1967", "do a review of this Curator audio PR", "what should I flag on PR
1898"). It is NOT for a PR author responding to a review.
---

# Review NeMo Curator audio PRs

**Use this when you are the reviewer and need to review someone else's audio
Curator PR.** Given a PR number, it helps you understand the change, see what
other reviewers already said, and write your own review. It does not help a PR
author reply to reviewers - it helps you *produce* the review.

Scope: the **audio** modality only - code under `nemo_curator/stages/audio/`,
`nemo_curator/tasks/audio_task.py`, `tutorials/audio/`, audio tests under
`tests/stages/audio/`, and audio benchmarks. This skill is **audio-only**: if a
PR touches no audio path, the pull step refuses to run (step 2) - review it
with a modality-appropriate tool instead.

Requirements: an environment with `git` and the GitHub CLI (`gh`) **already
authenticated** against `github.com`. Everything the skill does goes through
`gh`/`git`, so a working `gh` is the only credential you need. Verify with `gh
auth status`; if it is not green, give it an authenticated environment one of
these ways before invoking the skill:

- **Interactive:** run `gh auth login` (choose GitHub.com, HTTPS) once; the token
is cached for future runs.
- **Token (CI / headless / sandbox):** export `GH_TOKEN` (or `GITHUB_TOKEN`) with
a PAT that has `repo` scope - e.g. `export GH_TOKEN=ghp_...` - then `gh auth
status` should pass. This is the way to hand a remote/agent runner a
`gh`-configured environment.
- **Existing config:** if you already have a `~/.config/gh/hosts.yml` (or a host
with `gh` set up), run the skill from that shell/user so it inherits the auth.

You do **not** need to pre-clone the repo - step 0 reuses any checkout you
already have and only shallow-clones
[NVIDIA-NeMo/Curator](https://github.com/NVIDIA-NeMo/Curator) if none is found.

## The reviewer prompt

A human reviewer invokes this skill with a prompt like:

> **"Review audio Curator PR <N> with the review-curator-audio-pr skill. Find or
> shallow-clone the repo, pull the PR and prior review activity, build the
> post-#1608 audio review corpus, read the knowledge sources, then give me a
> detailed overview of what the PR does
> followed by my own findings (P0-P3) with exact `path:line` evidence and
> concrete fixes, ready to post as PR comments. Skip anything other reviewers
> already raised."**

Variants: *"re-review PR <N> and show only what changed since the last review"*,
*"what should I flag on the diarization PR <N>?"*, *"triage open audio PRs and
review the smallest one first"*. Every review run must build (or refresh) the
post-#1608 audio review corpus before findings (see
[knowledge-sources.md](knowledge-sources.md) section 4).

Run the prompt in an environment where `gh auth status` passes. For a
remote/headless agent, say so explicitly, e.g. *"`gh` is authenticated via
`GH_TOKEN` in this environment - review audio Curator PR <N> ..."* (see
Requirements above for how to provide a `gh`-configured environment).

## What you produce

You produce two things, **in this order**: (1) a **detailed overview of what the
PR does and why** - written only after you have read the knowledge sources (the
canonical docs and review lenses), so the reader understands the change before
any critique - then (2) your **findings** (P0-P3), each tied to a `path:line` on
the PR's current head, that you post as PR review comments. Never lead with
review comments; ground yourself in the knowledge sources, then explain the PR,
then give findings.

The skill writes two helper files into a scratch directory (default
`.curator-pr-review/`, override with `--outdir`) to support that:

1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - **your working digest**: leads
with a **"What this PR does (overview)"** section (the author's description,
an areas-touched table, and a plain-language summary you write), then PR
state, commits, changed-file table, and every existing review/comment grouped
by file with OPEN / OUTDATED / RESOLVED status, then a placeholder for your
findings. Fill in the overview first, then record your findings at the bottom.
2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - **prior open threads**:
a condensed list of the threads other reviewers left that are still
unresolved on the current head. Scan it before writing your own comments.

See [templates.md](templates.md) for the exact layout and how to phrase your
review comments. The review lenses and every doc/code reference live in
[knowledge-sources.md](knowledge-sources.md).

## Workflow

```
- [ ] 0. Locate or shallow-clone the Curator repo
- [ ] 1. Identify the PR and confirm it touches audio paths
- [ ] 2. Pull fresh GitHub data (gh)
- [ ] 3. Build the post-#1608 audio review corpus (required)
- [ ] 4. Read the diff
- [ ] 5. Read the knowledge sources (canonical docs + review lenses + corpus) to ground yourself
- [ ] 6. Generate the digest + prior-threads file for context
- [ ] 7. Write the detailed PR overview - only after steps 4-5 - present this first
- [ ] 8. Review through the lenses, write up findings (P0-P3), and post them after the overview
```

### Step 0 - Locate or shallow-clone the repo

Don't clone unnecessarily. This helper checks whether you are already inside a
Curator checkout, searches the current directory tree for one, and only then
shallow-clones (`--depth 1`, no full history):

```bash
eval "$(scripts/ensure_repo.sh | tail -1)" # sets CURATOR_REPO=<path>
cd "$CURATOR_REPO"
```

If you are already in the checkout that contains this skill, you can skip this.

### Step 1 - Identify the PR

`gh pr view <N> --repo NVIDIA-NeMo/Curator`. This skill is **audio-only**: the
pull step (step 2) aborts if the PR touches no audio path
(`nemo_curator/stages/audio/`, `nemo_curator/tasks/audio_task.py`,
`tutorials/audio/`, `tests/stages/audio/`, audio benchmarks). If it aborts, the
PR is out of scope for this skill, so stop and do not review it here.

### Step 2 - Pull fresh GitHub data

```bash
.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh <N>
```

Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
`isResolved` / `isOutdated` flags the REST inline endpoint omits. You pull this
to see prior review activity, not because you own the PR.

### Step 3 - Build the post-#1608 audio review corpus (required)

Before you explain or judge anything, build the consolidated reviewer-comment
corpus so you can spot cross-PR repeated mistakes. Do **not** skip this step,
even if a prior corpus file exists on disk - rerun the pull (incremental by
default) and render a fresh consolidated file for today's review:

```bash
.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh --since 1608
.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py --today <YYYY-MM-DD>
```

Confirm `.curator-pr-review/audio-corpus/audio_pr_corpus_<date>.md` exists and
read it before writing findings. If the pull scripts fail, stop and fix the
failure - do not proceed with a review that skipped the corpus. See
[knowledge-sources.md](knowledge-sources.md) section 4 for details.

### Step 4 - Read the diff

```bash
gh pr diff <N> --repo NVIDIA-NeMo/Curator # works on a shallow clone
# optional, to run/inspect locally: gh pr checkout <N>
```

`gh pr diff` fetches the patch straight from GitHub, so it works even on the
shallow clone from step 0. Always review the actual diff, never the comment text
alone. Use `main` as the baseline for "how the audio stages already do this".

### Step 5 - Read the knowledge sources (ground yourself first)

Before you explain or judge anything, read
[knowledge-sources.md](knowledge-sources.md) so your overview and findings are
grounded in the canonical material, not guesswork:

- **section 0** - canonical docs: the audio stage developer guide
(`nemo_curator/stages/audio/README.md`), the developer-guide slides, the
published fern audio docs, `.cursor/rules/*.mdc`, and `CONTRIBUTING.md`.
- **section 1** - the audio code map, so you know where each changed file sits.
- **section 2** - the review lenses (you apply these in step 8).
- **section 4 + corpus file** - the post-#1608 reviewer-comment corpus from step 3.

Open the specific sources the diff touches (e.g. the fern page and the code for
the stage being changed). Check the corpus for recurring themes and whether this
PR repeats feedback already raised on other audio PRs. Only after this grounding
do you write the overview (step 7).

### Step 6 - Generate the context files

```bash
.cursor/skills/review-curator-audio-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
```

The builder joins the pulled JSON and classifies each existing comment by thread
status: **OPEN** (still unresolved), **OUTDATED** (pre-dates the current head),
**RESOLVED**. Use the OPEN list to avoid re-raising points another reviewer
already made.

### Step 7 - Summarize the PR in detail (present this first)

Now that you have read the diff (step 4) and the knowledge sources (step 5),
write the **"What this PR does (overview)"** section of the digest so a reader
understands the change end to end: the problem it solves; the
main audio stages/files it adds or modifies (use the "Areas touched" table the
builder generates); key design decisions; new dependencies, config, or APIs; and
the blast radius (what could regress). Lead your review with this overview - the
author and other reviewers read it before any findings.

### Step 8 - Review through the lenses, then write up and post your findings

Apply the lenses in [knowledge-sources.md](knowledge-sources.md) section 2 to the
diff - each lens links the audio code, README section, and `.cursor/rules`
contract it governs: stage contracts, setup/teardown lifecycle, audio
optional-dependency hygiene (vLLM, NeMo, model utils), secret-safe logging,
waveform/tensor memory and manifest serialization, tarred/sharded audio I/O,
sample-rate and metadata propagation, streaming/throughput, tutorials/docs,
tests/coverage, and PR reviewability. Cite the linked source by name whenever
the PR deviates.

Record each issue as a finding, classified by severity, and post them **after
the overview** as a PR review (inline comments for specific lines, a top-level
summary for the overall verdict). Each finding cites the exact `path:line-range`
on the current head and proposes a concrete fix.

- **P0** - merge blocker: data loss / crash on a valid audio config (e.g. a
manifest writer that crashes on a kept waveform), a secret (HF token) leaked
into logs, or an import that breaks a supported install.
- **P1** - should fix before merge: undeclared audio runtime dependencies,
missing required tests/coverage, environment-specific I/O in a reusable
reader, a PR too large to review safely.
- **P2** - worth fixing: duplicated pipeline construction, inefficient per-task
writer batching, incomplete throughput metrics, documented-but-unwired config.
- **P3** - nice to have: temporary model/compat workarounds needing a test +
comment, debug-only knobs (e.g. per-shard utterance caps) to document.

See [templates.md](templates.md) section C for how to phrase review comments.

## What to check the PR against (CONTRIBUTING.md and rules)

A compliant audio PR must satisfy these; flag any that are missing:

- **DCO sign-off** on every commit (`git commit -s`).
- **Tests** accompany source changes; GPU-only audio tests use
`@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
and `tests/stages/audio/` mirrors `nemo_curator/stages/audio/`.
- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
- **Ruff** lint/format, 119-char lines; **loguru** for logging.
- Supported Python: 3.10-3.12.

## Knowledge sources

[knowledge-sources.md](knowledge-sources.md) is the single index: canonical docs
(audio README, developer-guide slides, `.cursor/rules`, tutorials, benchmarks),
the audio code map, the review lenses with per-concept code references, the
post-#1608 PR corpus workflow, and the GitHub/`gh` data reference.
Loading
Loading