Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs by mohammadaaftabv · Pull Request #2051 · NVIDIA-NeMo/Curator

mohammadaaftabv · 2026-06-05T07:18:15Z

Summary

Adds review-curator-audio-pr, a self-contained Cursor skill that helps a
human reviewer review someone else's audio-modality Curator PR. It is an
assistant for the person doing the review - not an auto-approving bot, and not a
tool for the PR author to answer reviewers. The human stays in control: the skill
gathers context and drafts findings, and the reviewer reads, edits, and decides
what to actually post.

Repo-only (gh + git).

How it works (high level)

Given just a PR number, the skill runs this end to end:

Locate the code — reuse an existing Curator checkout or shallow-clone one
(ensure_repo.sh).
Automatically pull the PR and the comments on it — via gh it fetches the
PR metadata, the diff / changed files, the commits, and every prior review,
inline comment, and top-level comment on the PR, plus the GraphQL review
threads so each existing comment is tagged OPEN / OUTDATED / RESOLVED. No
manual copy-pasting of the PR or its review history.
Ground itself — build the post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 audio reviewer-comment corpus and read
the audio knowledge sources (docs, code map, review lenses).
Explain first — produce a detailed "what this PR does" overview before any
critique.
Draft findings — suggest P0–P3 findings with exact path:line evidence and
concrete fixes, skipping anything other reviewers already raised.
Human decides — the reviewer reviews/edits the draft and chooses what to
post. The skill posts nothing on its own.

So the automation is in the pulling and analysis (the skill pulls the PR and
all comments on it for you); the posting stays a human decision.

Audio-only: scope is nemo_curator/stages/audio/,
nemo_curator/tasks/audio_task.py, tutorials/audio/, tests/stages/audio/,
and audio benchmarks. The pull/digest steps abort on a PR that touches no audio
path, so the skill never reviews a non-audio PR.

Who it's for

A human reviewer assigned an audio Curator PR who wants grounded context
and a first-pass set of findings to review, refine, and post.
It surfaces what other reviewers already raised (OPEN/OUTDATED/RESOLVED) so the
human doesn't duplicate, and cites the canonical docs/rules behind each point.
It is not for the PR author replying to a review, and it does not post
anything on its own - the reviewer chooses what to submit.

Contents (`.cursor/skills/review-curator-audio-pr/`)

SKILL.md — reviewer workflow: locate → pull → diff → read knowledge sources
→ overview-first → lenses + P0–P3 findings, plus how to provide a
gh-authenticated environment.
knowledge-sources.md — audio docs (README, dev-guide slides, fern docs,
.cursor/rules, CONTRIBUTING), code map, review lenses, severity guide, and
the post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 reviewer-comment corpus workflow.
templates.md — output layout (overview, digest, threads, comments).
scripts/ — ensure_repo.sh, pr_review_pull.sh (audio-only),
build_digest.py (audio-only), pull_audio_pr_corpus.sh (incremental;
--refresh to re-pull), build_corpus.py, README.md.
.gitignore — ignore scratch dir .curator-pr-review/.

Test plan

bash -n / ast.parse for all scripts.
build_digest.py: aborts on non-audio PR; overview renders before
findings; correct OPEN/OUTDATED/RESOLVED.
pull_audio_pr_corpus.sh: incremental skip of PRs already on disk.
build_corpus.py end-to-end on synthetic data.
Maintainer sanity check of skill discovery in Cursor.

Add a self-contained Cursor Agent Skill under .cursor/skills/review-curator-pr that helps contributors and maintainers review NVIDIA-NeMo/Curator pull requests consistently. The skill: - pulls fresh PR data via the GitHub CLI (REST endpoints + GraphQL review threads, which carry isResolved/isOutdated); - generates a structured review digest and a paste-ready open-comment queue; - applies review lenses built on the repo's own .cursor/rules contracts and CONTRIBUTING.md standards (stage contracts, setup lifecycle, dependency hygiene, secret-safe logging, tests/coverage, PR reviewability). Includes two helper scripts (scripts/pr_review_pull.sh, scripts/build_digest.py) that write to a scratch .curator-pr-review/ directory, which is gitignored. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

copy-pr-bot · 2026-06-05T07:18:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-05T07:21:26Z

Greptile Summary

This PR adds a self-contained Cursor skill (review-curator-audio-pr) for reviewing audio-modality Curator pull requests. It includes Markdown workflow/knowledge files, shell scripts for fetching GitHub data, and Python scripts for building a structured review digest and a reviewer-comment corpus — all scoped to audio paths.

pr_review_pull.sh + build_digest.py pull six REST endpoints and a GraphQL thread payload for a single PR, build a fresh-review digest, and produce an open-comment queue.
pull_audio_pr_corpus.sh + build_corpus.py incrementally collect and consolidate reviewer feedback across all post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 audio PRs for pre-review context.

Confidence Score: 3/5

The audio-guard false-rejection in pr_review_pull.sh means the tool silently refuses to process any audio PR with more than ~30 changed files — the exact scenario it is meant to handle — before the review even begins.

The --paginate call for the files endpoint at line 52 produces concatenated JSON arrays for PRs with more than one page of results; the inline Python audio guard then raises JSONDecodeError and the script exits incorrectly with 'PR touches no audio path'. This happens at a much lower threshold than the pagination issues flagged on lines 70-77, and it silently kills the entire workflow rather than degrading the digest. Multiple other pagination-related issues in the same file and in pull_audio_pr_corpus.sh remain open from prior rounds as well.

pr_review_pull.sh — both the files endpoint (line 52) and the reviews/comments/commits endpoints (lines 70-77) need the --paginate/--jq fix before the tool behaves correctly on real-world PRs.

Important Files Changed

Filename	Overview
.cursor/skills/review-curator-audio-pr/scripts/pr_review_pull.sh	Audio-only guard at lines 57-62 reads the files JSON with json.load(), but the file is written by --paginate at line 52 without --jq; any PR with >30 changed files produces concatenated JSON arrays that crash the guard and incorrectly abort with 'PR touches no audio path'.
.cursor/skills/review-curator-audio-pr/scripts/build_digest.py	Builds the review digest and comment queue; several pre-existing pagination and null-safety issues flagged in earlier review threads remain open.
.cursor/skills/review-curator-audio-pr/scripts/pull_audio_pr_corpus.sh	Incremental corpus-pull script; --paginate + --jq concatenation issue for the files check was already flagged; lines 114-116 share the same root cause for reviews/comments.
.cursor/skills/review-curator-audio-pr/scripts/build_corpus.py	Consolidates per-PR JSON into a recurring-themes corpus; uses (user or {}).get() guards throughout, avoids the null-user crash seen in build_digest.py.
.cursor/skills/review-curator-audio-pr/scripts/ensure_repo.sh	Locates or shallow-clones the Curator repo; logic is sound with correct fallbacks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([pr_review_pull.sh]) --> B["gh api --paginate pulls/PR/files\n(line 52, no --jq)"]
    B --> C{PR has >30\nchanged files?}
    C -- No --> D["Single JSON array\n→ json.load() OK"]
    C -- Yes --> E["Concatenated arrays\n[...][...]\n→ JSONDecodeError"]
    D --> F["Audio guard passes\n→ continue pulling"]
    E --> G["Python exits code 1\nif ! fires\n→ exit 3: 'no audio path'"]
    G --> H([❌ Incorrect rejection\nof valid audio PR])
    F --> I["gh api --paginate reviews/comments/commits\n(lines 70-77, no --jq)"]
    I --> J{>100 reviews or comments?}
    J -- Yes --> K["Concatenated arrays\n→ build_digest.py JSONDecodeError"]
    J -- No --> L["build_digest.py\nproduces digest ✓"]

_{Reviews (9): Last reviewed commit: "review-curator-audio-pr: document how to..." | Re-trigger Greptile}

greptile-apps · 2026-06-05T07:21:31Z

+                    help="timestamp suffix of a prior pull to mark NEW vs already-seen")
+    ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta")
+    args = ap.parse_args()
+    today = args.today or dt.datetime.utcnow().date().isoformat()


dt.datetime.utcnow() was deprecated in Python 3.12 and will emit DeprecationWarning on the project's supported Python 3.12 target. Use the timezone-aware replacement instead.

Suggested change

today = args.today or dt.datetime.utcnow().date().isoformat()

today = args.today or dt.datetime.now(dt.timezone.utc).date().isoformat()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-05T07:21:32Z

+#!/usr/bin/env python3
+"""Build a Curator PR fresh-review digest + open-comment queue.


build_digest.py is missing the NVIDIA Apache-2.0 copyright header that the project's own recurring-themes.md (added by this same PR) mandates on every non-empty Python file. The skill's review lenses will flag this on contributor PRs while the skill itself violates the convention.

Suggested change

#!/usr/bin/env python3

"""Build a Curator PR fresh-review digest + open-comment queue.

#!/usr/bin/env python3

# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

"""Build a Curator PR fresh-review digest + open-comment queue.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Rename the skill to review-curator-audio-pr and refocus its guidance on audio-modality PRs (nemo_curator/stages/audio, tutorials/audio, audio benchmarks, tests/stages/audio). - SKILL.md: audio scope, step 1 confirms the PR touches audio paths, audio review lenses, audio-oriented P0-P3 examples. - reference.md: audio stage map (io, preprocessing, inference asr/vad/ speaker_diarization, filtering, segmentation, tagging, postprocessing, metrics, datasets, alm, advanced_pipelines), audio tutorials, audio benchmarks, audio tests/fixtures; anchored on stages/audio/README.md. - recurring-themes.md: audio-specific lenses - waveform/tensor memory and manifest serialization, tarred/sharded I/O, optional audio deps, sample-rate and metadata propagation, throughput. - scripts/README.md: updated paths. Helper scripts are unchanged (they fetch any PR); the audio focus lives in the review lenses. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…rs PRs Reframe SKILL.md, templates.md, scripts and generated digest/threads output so it is unambiguous the skill is used by a reviewer to review someone elses audio Curator PR, not by an author responding to reviewers. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…ost-1608 PR corpus ensure_repo.sh reuses an existing checkout or shallow-clones (no full history). SKILL.md gains an explicit human-reviewer prompt and a repo-locate step. recurring-themes.md and reference.md are merged into knowledge-sources.md, with each review lens linked to the audio code, README section, .cursor/rules contract, and the developer-guide slides. Adds pull_audio_pr_corpus.sh + build_corpus.py to consolidate reviewer feedback from audio PRs after 1608 as pre-review context. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps · 2026-06-05T09:43:54Z

+AUDIO_NUMS=()
+for n in "${CANDIDATES[@]}"; do
+    files_json="${OUTDIR}/pr${n}_files.json"
+    gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \


--paginate + --jq concatenates raw JSON arrays, breaking json.load() for large PRs

When a PR has more than 100 changed files, gh api --paginate issues multiple requests. With --jq '[.[].filename]', the filter is applied independently to each page and the outputs are raw-concatenated — producing ["a","b"]["c","d"] rather than a single valid JSON array. The downstream json.load() on line 59 raises JSONDecodeError, the if … then branch evaluates false (exit code 1), and the PR is silently excluded from the corpus even when it clearly touches audio paths. Remove --jq from the gh api call and filter in the Python inline script instead, where you already have full control over the parsed structure.

… sources Audit gap: the knowledge guide cited READMEs, .cursor/rules, and the slides deck but not the published audio documentation set under fern/versions/v26.04/pages/curate-audio. Add a fern doc map (concept -> doc -> code), wire fern refs into the memory/throughput/tutorial lenses, include the readspeech notebook, and pin the slides URL to the exact slide anchor. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps · 2026-06-05T09:51:33Z

+    p(d, "```")
+    for c in new_commits:
+        sha = c["sha"][:8]
+        msg = (c["commit"]["message"] or "").splitlines()[0]


"".splitlines()[0] raises IndexError when a commit's message field is None or an empty string. GitHub's API guarantees a non-null message in practice, but if a merge commit's message is empty (or the API returns null on a bot/squashed commit), this crashes the entire digest build with IndexError: list index out of range. Use str.split(" ", 1) instead: "".split(" ", 1) returns [""], so [0] is safe regardless of content.

Suggested change

msg = (c["commit"]["message"] or "").splitlines()[0]

msg = (c["commit"]["message"] or "").split("\n", 1)[0]

build_digest.py now emits a What this PR does (overview) section first: author description, an areas-touched table, and a plain-language summary placeholder, ahead of PR state, existing reviews/comments, and findings. SKILL.md and templates.md require presenting this overview before any review comments (new Step 6). Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…sources Make reading knowledge-sources.md an explicit Step 4 (canonical docs + code map + lenses) that precedes the detailed PR overview (Step 6); lens application moves into the findings step (Step 7). Overview placeholders in build_digest.py and templates.md note the summary is written after grounding in the knowledge sources. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps · 2026-06-05T09:59:09Z

+pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"
+pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/comments"
+pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \
+    gh api --paginate "repos/${REPO}/issues/${PR}/comments"
+pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/files"
+pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \
+    gh api --paginate "repos/${REPO}/pulls/${PR}/commits"


gh api --paginate without --jq outputs each page's JSON array verbatim, back-to-back. A PR with >100 reviews, >100 inline comments, or >300 changed files causes build_digest.py's json.loads() to receive [...][...] — not a valid JSON document — and raise JSONDecodeError, silently producing an empty digest. The previous thread on this same root cause in pull_audio_pr_corpus.sh noted the fix: use --jq '.[]' per-page and wrap the result in a Python accumulator.

Suggested change

pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \

gh api --paginate "repos/${REPO}/pulls/${PR}/reviews"

pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \

gh api --paginate "repos/${REPO}/pulls/${PR}/comments"

pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \

gh api --paginate "repos/${REPO}/issues/${PR}/comments"

pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \

gh api --paginate "repos/${REPO}/pulls/${PR}/files"

pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \

gh api --paginate "repos/${REPO}/pulls/${PR}/commits"

pull_paginated() {

# Accumulate all pages into a single JSON array.

local label="$1" outfile="$2"; shift 2

echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}"

gh api --paginate --jq '.[]' "$@" \

| python3 -c "import json,sys; print(json.dumps(list(map(json.loads, sys.stdin))))" \

> "${outfile}"

printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}"

}

pull_paginated "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \

"repos/${REPO}/pulls/${PR}/reviews"

pull_paginated "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \

"repos/${REPO}/pulls/${PR}/comments"

pull_paginated "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \

"repos/${REPO}/issues/${PR}/comments"

pull_paginated "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \

"repos/${REPO}/pulls/${PR}/files"

pull_paginated "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \

"repos/${REPO}/pulls/${PR}/commits"

…ulls, doc cosmetics Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Make corpus pull + build_corpus mandatory in the skill workflow so reviewers cannot skip cross-PR pattern grounding when reviewing audio Curator PRs. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…nvironment Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv requested a review from a team as a code owner June 5, 2026 07:18

mohammadaaftabv requested review from suiyoubi and removed request for a team June 5, 2026 07:18

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

mohammadaaftabv changed the title ~~Add review-curator-pr Cursor skill for reviewing Curator PRs~~ Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs Jun 5, 2026

mohammadaaftabv added 2 commits June 5, 2026 14:42

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

mohammadaaftabv added 2 commits June 5, 2026 15:21

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

mohammadaaftabv added 3 commits June 9, 2026 17:57

review-curator-audio-pr: audio-only enforcement, incremental corpus p…

f8ae3bd

…ulls, doc cosmetics Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Require post-NVIDIA-NeMo#1608 audio corpus build before every PR review.

5a930a7

Make corpus pull + build_corpus mandatory in the skill workflow so reviewers cannot skip cross-PR pattern grounding when reviewing audio Curator PRs. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

review-curator-audio-pr: document how to provide a gh-authenticated e…

cfd4994

…nvironment Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

sarahyurick self-requested a review June 11, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs#2051

Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs#2051
mohammadaaftabv wants to merge 10 commits into
NVIDIA-NeMo:mainfrom
mohammadaaftabv:add-curator-pr-review-skill

mohammadaaftabv commented Jun 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	today = args.today or dt.datetime.utcnow().date().isoformat()
	today = args.today or dt.datetime.now(dt.timezone.utc).date().isoformat()

		#!/usr/bin/env python3
		"""Build a Curator PR fresh-review digest + open-comment queue.

-#!/usr/bin/env python3
-"""Build a Curator PR fresh-review digest + open-comment queue.
+#!/usr/bin/env python3
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Build a Curator PR fresh-review digest + open-comment queue.

	msg = (c["commit"]["message"] or "").splitlines()[0]
	msg = (c["commit"]["message"] or "").split("\n", 1)[0]

Conversation

mohammadaaftabv commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works (high level)

Who it's for

Contents (.cursor/skills/review-curator-audio-pr/)

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

greptile-apps Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mohammadaaftabv commented Jun 5, 2026 •

edited

Loading

Contents (`.cursor/skills/review-curator-audio-pr/`)

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading