Skip to content

Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs#2051

Open
mohammadaaftabv wants to merge 10 commits into
NVIDIA-NeMo:mainfrom
mohammadaaftabv:add-curator-pr-review-skill
Open

Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs#2051
mohammadaaftabv wants to merge 10 commits into
NVIDIA-NeMo:mainfrom
mohammadaaftabv:add-curator-pr-review-skill

Conversation

@mohammadaaftabv

@mohammadaaftabv mohammadaaftabv commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds review-curator-audio-pr, a self-contained Cursor skill that helps a
human reviewer review someone else's audio-modality Curator PR
. It is an
assistant for the person doing the review - not an auto-approving bot, and not a
tool for the PR author to answer reviewers. The human stays in control: the skill
gathers context and drafts findings, and the reviewer reads, edits, and decides
what to actually post.

Repo-only (gh + git).

How it works (high level)

Given just a PR number, the skill runs this end to end:

  1. Locate the code — reuse an existing Curator checkout or shallow-clone one
    (ensure_repo.sh).
  2. Automatically pull the PR and the comments on it — via gh it fetches the
    PR metadata, the diff / changed files, the commits, and every prior review,
    inline comment, and top-level comment on the PR
    , plus the GraphQL review
    threads so each existing comment is tagged OPEN / OUTDATED / RESOLVED. No
    manual copy-pasting of the PR or its review history.
  3. Ground itself — build the post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 audio reviewer-comment corpus and read
    the audio knowledge sources (docs, code map, review lenses).
  4. Explain first — produce a detailed "what this PR does" overview before any
    critique.
  5. Draft findings — suggest P0–P3 findings with exact path:line evidence and
    concrete fixes, skipping anything other reviewers already raised.
  6. Human decides — the reviewer reviews/edits the draft and chooses what to
    post. The skill posts nothing on its own.

So the automation is in the pulling and analysis (the skill pulls the PR and
all comments on it for you); the posting stays a human decision.

Audio-only: scope is nemo_curator/stages/audio/,
nemo_curator/tasks/audio_task.py, tutorials/audio/, tests/stages/audio/,
and audio benchmarks. The pull/digest steps abort on a PR that touches no audio
path, so the skill never reviews a non-audio PR.

Who it's for

  • A human reviewer assigned an audio Curator PR who wants grounded context
    and a first-pass set of findings to review, refine, and post.
  • It surfaces what other reviewers already raised (OPEN/OUTDATED/RESOLVED) so the
    human doesn't duplicate, and cites the canonical docs/rules behind each point.
  • It is not for the PR author replying to a review, and it does not post
    anything on its own - the reviewer chooses what to submit.

Contents (.cursor/skills/review-curator-audio-pr/)

  • SKILL.md — reviewer workflow: locate → pull → diff → read knowledge sources
    → overview-first → lenses + P0–P3 findings, plus how to provide a
    gh-authenticated environment.
  • knowledge-sources.md — audio docs (README, dev-guide slides, fern docs,
    .cursor/rules, CONTRIBUTING), code map, review lenses, severity guide, and
    the post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 reviewer-comment corpus workflow.
  • templates.md — output layout (overview, digest, threads, comments).
  • scripts/ensure_repo.sh, pr_review_pull.sh (audio-only),
    build_digest.py (audio-only), pull_audio_pr_corpus.sh (incremental;
    --refresh to re-pull), build_corpus.py, README.md.
  • .gitignore — ignore scratch dir .curator-pr-review/.

Test plan

  • bash -n / ast.parse for all scripts.
  • build_digest.py: aborts on non-audio PR; overview renders before
    findings; correct OPEN/OUTDATED/RESOLVED.
  • pull_audio_pr_corpus.sh: incremental skip of PRs already on disk.
  • build_corpus.py end-to-end on synthetic data.
  • Maintainer sanity check of skill discovery in Cursor.

Add a self-contained Cursor Agent Skill under .cursor/skills/review-curator-pr
that helps contributors and maintainers review NVIDIA-NeMo/Curator pull
requests consistently.

The skill:
- pulls fresh PR data via the GitHub CLI (REST endpoints + GraphQL review
  threads, which carry isResolved/isOutdated);
- generates a structured review digest and a paste-ready open-comment queue;
- applies review lenses built on the repo's own .cursor/rules contracts and
  CONTRIBUTING.md standards (stage contracts, setup lifecycle, dependency
  hygiene, secret-safe logging, tests/coverage, PR reviewability).

Includes two helper scripts (scripts/pr_review_pull.sh, scripts/build_digest.py)
that write to a scratch .curator-pr-review/ directory, which is gitignored.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
@mohammadaaftabv mohammadaaftabv requested a review from a team as a code owner June 5, 2026 07:18
@mohammadaaftabv mohammadaaftabv requested review from suiyoubi and removed request for a team June 5, 2026 07:18
@copy-pr-bot

copy-pr-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a self-contained Cursor skill (review-curator-audio-pr) for reviewing audio-modality Curator pull requests. It includes Markdown workflow/knowledge files, shell scripts for fetching GitHub data, and Python scripts for building a structured review digest and a reviewer-comment corpus — all scoped to audio paths.

  • pr_review_pull.sh + build_digest.py pull six REST endpoints and a GraphQL thread payload for a single PR, build a fresh-review digest, and produce an open-comment queue.
  • pull_audio_pr_corpus.sh + build_corpus.py incrementally collect and consolidate reviewer feedback across all post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 audio PRs for pre-review context.

Confidence Score: 3/5

The audio-guard false-rejection in pr_review_pull.sh means the tool silently refuses to process any audio PR with more than ~30 changed files — the exact scenario it is meant to handle — before the review even begins.

The --paginate call for the files endpoint at line 52 produces concatenated JSON arrays for PRs with more than one page of results; the inline Python audio guard then raises JSONDecodeError and the script exits incorrectly with 'PR touches no audio path'. This happens at a much lower threshold than the pagination issues flagged on lines 70-77, and it silently kills the entire workflow rather than degrading the digest. Multiple other pagination-related issues in the same file and in pull_audio_pr_corpus.sh remain open from prior rounds as well.

pr_review_pull.sh — both the files endpoint (line 52) and the reviews/comments/commits endpoints (lines 70-77) need the --paginate/--jq fix before the tool behaves correctly on real-world PRs.

Important Files Changed

Filename Overview
.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh Audio-only guard at lines 57-62 reads the files JSON with json.load(), but the file is written by --paginate at line 52 without --jq; any PR with >30 changed files produces concatenated JSON arrays that crash the guard and incorrectly abort with 'PR touches no audio path'.
.cursor/skills/review-curator-audio-pr/scripts/build_digest.py Builds the review digest and comment queue; several pre-existing pagination and null-safety issues flagged in earlier review threads remain open.
.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh Incremental corpus-pull script; --paginate + --jq concatenation issue for the files check was already flagged; lines 114-116 share the same root cause for reviews/comments.
.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py Consolidates per-PR JSON into a recurring-themes corpus; uses (user or {}).get() guards throughout, avoids the null-user crash seen in build_digest.py.
.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh Locates or shallow-clones the Curator repo; logic is sound with correct fallbacks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([pr_review_pull.sh]) --> B["gh api --paginate pulls/PR/files\n(line 52, no --jq)"]
    B --> C{PR has >30\nchanged files?}
    C -- No --> D["Single JSON array\n→ json.load() OK"]
    C -- Yes --> E["Concatenated arrays\n[...][...]\n→ JSONDecodeError"]
    D --> F["Audio guard passes\n→ continue pulling"]
    E --> G["Python exits code 1\nif ! fires\n→ exit 3: 'no audio path'"]
    G --> H([❌ Incorrect rejection\nof valid audio PR])
    F --> I["gh api --paginate reviews/comments/commits\n(lines 70-77, no --jq)"]
    I --> J{>100 reviews or comments?}
    J -- Yes --> K["Concatenated arrays\n→ build_digest.py JSONDecodeError"]
    J -- No --> L["build_digest.py\nproduces digest ✓"]
Loading

Reviews (9): Last reviewed commit: "review-curator-audio-pr: document how to..." | Re-trigger Greptile

help="timestamp suffix of a prior pull to mark NEW vs already-seen")
ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta")
args = ap.parse_args()
today = args.today or dt.datetime.utcnow().date().isoformat()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 dt.datetime.utcnow() was deprecated in Python 3.12 and will emit DeprecationWarning on the project's supported Python 3.12 target. Use the timezone-aware replacement instead.

Suggested change
today = args.today or dt.datetime.utcnow().date().isoformat()
today = args.today or dt.datetime.now(dt.timezone.utc).date().isoformat()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +1 to +2
#!/usr/bin/env python3
"""Build a Curator PR fresh-review digest + open-comment queue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 build_digest.py is missing the NVIDIA Apache-2.0 copyright header that the project's own recurring-themes.md (added by this same PR) mandates on every non-empty Python file. The skill's review lenses will flag this on contributor PRs while the skill itself violates the convention.

Suggested change
#!/usr/bin/env python3
"""Build a Curator PR fresh-review digest + open-comment queue.
#!/usr/bin/env python3
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Build a Curator PR fresh-review digest + open-comment queue.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Rename the skill to review-curator-audio-pr and refocus its guidance on
audio-modality PRs (nemo_curator/stages/audio, tutorials/audio, audio
benchmarks, tests/stages/audio).

- SKILL.md: audio scope, step 1 confirms the PR touches audio paths, audio
  review lenses, audio-oriented P0-P3 examples.
- reference.md: audio stage map (io, preprocessing, inference asr/vad/
  speaker_diarization, filtering, segmentation, tagging, postprocessing,
  metrics, datasets, alm, advanced_pipelines), audio tutorials, audio
  benchmarks, audio tests/fixtures; anchored on stages/audio/README.md.
- recurring-themes.md: audio-specific lenses - waveform/tensor memory and
  manifest serialization, tarred/sharded I/O, optional audio deps, sample-rate
  and metadata propagation, throughput.
- scripts/README.md: updated paths.

Helper scripts are unchanged (they fetch any PR); the audio focus lives in the
review lenses.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
@mohammadaaftabv mohammadaaftabv changed the title Add review-curator-pr Cursor skill for reviewing Curator PRs Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs Jun 5, 2026
…rs PRs

Reframe SKILL.md, templates.md, scripts and generated digest/threads output so it is unambiguous the skill is used by a reviewer to review someone elses audio Curator PR, not by an author responding to reviewers.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…ost-1608 PR corpus

ensure_repo.sh reuses an existing checkout or shallow-clones (no full history). SKILL.md gains an explicit human-reviewer prompt and a repo-locate step. recurring-themes.md and reference.md are merged into knowledge-sources.md, with each review lens linked to the audio code, README section, .cursor/rules contract, and the developer-guide slides. Adds pull_audio_pr_corpus.sh + build_corpus.py to consolidate reviewer feedback from audio PRs after 1608 as pre-review context.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
AUDIO_NUMS=()
for n in "${CANDIDATES[@]}"; do
files_json="${OUTDIR}/pr${n}_files.json"
gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 --paginate + --jq concatenates raw JSON arrays, breaking json.load() for large PRs

When a PR has more than 100 changed files, gh api --paginate issues multiple requests. With --jq '[.[].filename]', the filter is applied independently to each page and the outputs are raw-concatenated — producing ["a","b"]["c","d"] rather than a single valid JSON array. The downstream json.load() on line 59 raises JSONDecodeError, the if … then branch evaluates false (exit code 1), and the PR is silently excluded from the corpus even when it clearly touches audio paths. Remove --jq from the gh api call and filter in the Python inline script instead, where you already have full control over the parsed structure.

… sources

Audit gap: the knowledge guide cited READMEs, .cursor/rules, and the slides deck but not the published audio documentation set under fern/versions/v26.04/pages/curate-audio. Add a fern doc map (concept -> doc -> code), wire fern refs into the memory/throughput/tutorial lenses, include the readspeech notebook, and pin the slides URL to the exact slide anchor.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
p(d, "```")
for c in new_commits:
sha = c["sha"][:8]
msg = (c["commit"]["message"] or "").splitlines()[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 "".splitlines()[0] raises IndexError when a commit's message field is None or an empty string. GitHub's API guarantees a non-null message in practice, but if a merge commit's message is empty (or the API returns null on a bot/squashed commit), this crashes the entire digest build with IndexError: list index out of range. Use str.split(" ", 1) instead: "".split(" ", 1) returns [""], so [0] is safe regardless of content.

Suggested change
msg = (c["commit"]["message"] or "").splitlines()[0]
msg = (c["commit"]["message"] or "").split("\n", 1)[0]

build_digest.py now emits a What this PR does (overview) section first: author description, an areas-touched table, and a plain-language summary placeholder, ahead of PR state, existing reviews/comments, and findings. SKILL.md and templates.md require presenting this overview before any review comments (new Step 6).

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…sources

Make reading knowledge-sources.md an explicit Step 4 (canonical docs + code map + lenses) that precedes the detailed PR overview (Step 6); lens application moves into the findings step (Step 7). Overview placeholders in build_digest.py and templates.md note the summary is written after grounding in the knowledge sources.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Comment on lines +51 to +60
pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"
pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/comments"
pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
gh api --paginate "repos/${REPO}/issues/${PR}/comments"
pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/files"
pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/commits"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 gh api --paginate without --jq outputs each page's JSON array verbatim, back-to-back. A PR with >100 reviews, >100 inline comments, or >300 changed files causes build_digest.py's json.loads() to receive [...][...] — not a valid JSON document — and raise JSONDecodeError, silently producing an empty digest. The previous thread on this same root cause in pull_audio_pr_corpus.sh noted the fix: use --jq '.[]' per-page and wrap the result in a Python accumulator.

Suggested change
pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"
pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/comments"
pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
gh api --paginate "repos/${REPO}/issues/${PR}/comments"
pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/files"
pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
gh api --paginate "repos/${REPO}/pulls/${PR}/commits"
pull_paginated() {
# Accumulate all pages into a single JSON array.
local label="$1" outfile="$2"; shift 2
echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}"
gh api --paginate --jq '.[]' "$@" \
| python3 -c "import json,sys; print(json.dumps(list(map(json.loads, sys.stdin))))" \
> "${outfile}"
printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}"
}
pull_paginated "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
"repos/${REPO}/pulls/${PR}/reviews"
pull_paginated "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
"repos/${REPO}/pulls/${PR}/comments"
pull_paginated "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
"repos/${REPO}/issues/${PR}/comments"
pull_paginated "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
"repos/${REPO}/pulls/${PR}/files"
pull_paginated "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
"repos/${REPO}/pulls/${PR}/commits"

…ulls, doc cosmetics

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Make corpus pull + build_corpus mandatory in the skill workflow so reviewers
cannot skip cross-PR pattern grounding when reviewing audio Curator PRs.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…nvironment

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
@sarahyurick sarahyurick self-requested a review June 11, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant