Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs#2051
Add review-curator-audio-pr Cursor skill for reviewing audio Curator PRs#2051mohammadaaftabv wants to merge 10 commits into
Conversation
Add a self-contained Cursor Agent Skill under .cursor/skills/review-curator-pr that helps contributors and maintainers review NVIDIA-NeMo/Curator pull requests consistently. The skill: - pulls fresh PR data via the GitHub CLI (REST endpoints + GraphQL review threads, which carry isResolved/isOutdated); - generates a structured review digest and a paste-ready open-comment queue; - applies review lenses built on the repo's own .cursor/rules contracts and CONTRIBUTING.md standards (stage contracts, setup lifecycle, dependency hygiene, secret-safe logging, tests/coverage, PR reviewability). Includes two helper scripts (scripts/pr_review_pull.sh, scripts/build_digest.py) that write to a scratch .curator-pr-review/ directory, which is gitignored. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Greptile SummaryThis PR adds a self-contained Cursor skill (
Confidence Score: 3/5The audio-guard false-rejection in pr_review_pull.sh means the tool silently refuses to process any audio PR with more than ~30 changed files — the exact scenario it is meant to handle — before the review even begins. The --paginate call for the files endpoint at line 52 produces concatenated JSON arrays for PRs with more than one page of results; the inline Python audio guard then raises JSONDecodeError and the script exits incorrectly with 'PR touches no audio path'. This happens at a much lower threshold than the pagination issues flagged on lines 70-77, and it silently kills the entire workflow rather than degrading the digest. Multiple other pagination-related issues in the same file and in pull_audio_pr_corpus.sh remain open from prior rounds as well. pr_review_pull.sh — both the files endpoint (line 52) and the reviews/comments/commits endpoints (lines 70-77) need the --paginate/--jq fix before the tool behaves correctly on real-world PRs. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([pr_review_pull.sh]) --> B["gh api --paginate pulls/PR/files\n(line 52, no --jq)"]
B --> C{PR has >30\nchanged files?}
C -- No --> D["Single JSON array\n→ json.load() OK"]
C -- Yes --> E["Concatenated arrays\n[...][...]\n→ JSONDecodeError"]
D --> F["Audio guard passes\n→ continue pulling"]
E --> G["Python exits code 1\nif ! fires\n→ exit 3: 'no audio path'"]
G --> H([❌ Incorrect rejection\nof valid audio PR])
F --> I["gh api --paginate reviews/comments/commits\n(lines 70-77, no --jq)"]
I --> J{>100 reviews or comments?}
J -- Yes --> K["Concatenated arrays\n→ build_digest.py JSONDecodeError"]
J -- No --> L["build_digest.py\nproduces digest ✓"]
Reviews (9): Last reviewed commit: "review-curator-audio-pr: document how to..." | Re-trigger Greptile |
| help="timestamp suffix of a prior pull to mark NEW vs already-seen") | ||
| ap.add_argument("--prev-head", default=None, help="prior reviewed head SHA for the commit delta") | ||
| args = ap.parse_args() | ||
| today = args.today or dt.datetime.utcnow().date().isoformat() |
There was a problem hiding this comment.
dt.datetime.utcnow() was deprecated in Python 3.12 and will emit DeprecationWarning on the project's supported Python 3.12 target. Use the timezone-aware replacement instead.
| today = args.today or dt.datetime.utcnow().date().isoformat() | |
| today = args.today or dt.datetime.now(dt.timezone.utc).date().isoformat() |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| #!/usr/bin/env python3 | ||
| """Build a Curator PR fresh-review digest + open-comment queue. |
There was a problem hiding this comment.
build_digest.py is missing the NVIDIA Apache-2.0 copyright header that the project's own recurring-themes.md (added by this same PR) mandates on every non-empty Python file. The skill's review lenses will flag this on contributor PRs while the skill itself violates the convention.
| #!/usr/bin/env python3 | |
| """Build a Curator PR fresh-review digest + open-comment queue. | |
| #!/usr/bin/env python3 | |
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| """Build a Curator PR fresh-review digest + open-comment queue. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Rename the skill to review-curator-audio-pr and refocus its guidance on audio-modality PRs (nemo_curator/stages/audio, tutorials/audio, audio benchmarks, tests/stages/audio). - SKILL.md: audio scope, step 1 confirms the PR touches audio paths, audio review lenses, audio-oriented P0-P3 examples. - reference.md: audio stage map (io, preprocessing, inference asr/vad/ speaker_diarization, filtering, segmentation, tagging, postprocessing, metrics, datasets, alm, advanced_pipelines), audio tutorials, audio benchmarks, audio tests/fixtures; anchored on stages/audio/README.md. - recurring-themes.md: audio-specific lenses - waveform/tensor memory and manifest serialization, tarred/sharded I/O, optional audio deps, sample-rate and metadata propagation, throughput. - scripts/README.md: updated paths. Helper scripts are unchanged (they fetch any PR); the audio focus lives in the review lenses. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…rs PRs Reframe SKILL.md, templates.md, scripts and generated digest/threads output so it is unambiguous the skill is used by a reviewer to review someone elses audio Curator PR, not by an author responding to reviewers. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…ost-1608 PR corpus ensure_repo.sh reuses an existing checkout or shallow-clones (no full history). SKILL.md gains an explicit human-reviewer prompt and a repo-locate step. recurring-themes.md and reference.md are merged into knowledge-sources.md, with each review lens linked to the audio code, README section, .cursor/rules contract, and the developer-guide slides. Adds pull_audio_pr_corpus.sh + build_corpus.py to consolidate reviewer feedback from audio PRs after 1608 as pre-review context. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
| AUDIO_NUMS=() | ||
| for n in "${CANDIDATES[@]}"; do | ||
| files_json="${OUTDIR}/pr${n}_files.json" | ||
| gh api --paginate "repos/${REPO}/pulls/${n}/files" --jq '[.[].filename]' \ |
There was a problem hiding this comment.
--paginate + --jq concatenates raw JSON arrays, breaking json.load() for large PRs
When a PR has more than 100 changed files, gh api --paginate issues multiple requests. With --jq '[.[].filename]', the filter is applied independently to each page and the outputs are raw-concatenated — producing ["a","b"]["c","d"] rather than a single valid JSON array. The downstream json.load() on line 59 raises JSONDecodeError, the if … then branch evaluates false (exit code 1), and the PR is silently excluded from the corpus even when it clearly touches audio paths. Remove --jq from the gh api call and filter in the Python inline script instead, where you already have full control over the parsed structure.
… sources Audit gap: the knowledge guide cited READMEs, .cursor/rules, and the slides deck but not the published audio documentation set under fern/versions/v26.04/pages/curate-audio. Add a fern doc map (concept -> doc -> code), wire fern refs into the memory/throughput/tutorial lenses, include the readspeech notebook, and pin the slides URL to the exact slide anchor. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
| p(d, "```") | ||
| for c in new_commits: | ||
| sha = c["sha"][:8] | ||
| msg = (c["commit"]["message"] or "").splitlines()[0] |
There was a problem hiding this comment.
"".splitlines()[0] raises IndexError when a commit's message field is None or an empty string. GitHub's API guarantees a non-null message in practice, but if a merge commit's message is empty (or the API returns null on a bot/squashed commit), this crashes the entire digest build with IndexError: list index out of range. Use str.split(" ", 1) instead: "".split(" ", 1) returns [""], so [0] is safe regardless of content.
| msg = (c["commit"]["message"] or "").splitlines()[0] | |
| msg = (c["commit"]["message"] or "").split("\n", 1)[0] |
build_digest.py now emits a What this PR does (overview) section first: author description, an areas-touched table, and a plain-language summary placeholder, ahead of PR state, existing reviews/comments, and findings. SKILL.md and templates.md require presenting this overview before any review comments (new Step 6). Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…sources Make reading knowledge-sources.md an explicit Step 4 (canonical docs + code map + lenses) that precedes the detailed PR overview (Step 6); lens application moves into the findings step (Step 7). Overview placeholders in build_digest.py and templates.md note the summary is written after grounding in the knowledge sources. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
| pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \ | ||
| gh api --paginate "repos/${REPO}/pulls/${PR}/reviews" | ||
| pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \ | ||
| gh api --paginate "repos/${REPO}/pulls/${PR}/comments" | ||
| pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \ | ||
| gh api --paginate "repos/${REPO}/issues/${PR}/comments" | ||
| pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ | ||
| gh api --paginate "repos/${REPO}/pulls/${PR}/files" | ||
| pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \ | ||
| gh api --paginate "repos/${REPO}/pulls/${PR}/commits" |
There was a problem hiding this comment.
gh api --paginate without --jq outputs each page's JSON array verbatim, back-to-back. A PR with >100 reviews, >100 inline comments, or >300 changed files causes build_digest.py's json.loads() to receive [...][...] — not a valid JSON document — and raise JSONDecodeError, silently producing an empty digest. The previous thread on this same root cause in pull_audio_pr_corpus.sh noted the fix: use --jq '.[]' per-page and wrap the result in a Python accumulator.
| pull_endpoint "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \ | |
| gh api --paginate "repos/${REPO}/pulls/${PR}/reviews" | |
| pull_endpoint "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \ | |
| gh api --paginate "repos/${REPO}/pulls/${PR}/comments" | |
| pull_endpoint "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \ | |
| gh api --paginate "repos/${REPO}/issues/${PR}/comments" | |
| pull_endpoint "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ | |
| gh api --paginate "repos/${REPO}/pulls/${PR}/files" | |
| pull_endpoint "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \ | |
| gh api --paginate "repos/${REPO}/pulls/${PR}/commits" | |
| pull_paginated() { | |
| # Accumulate all pages into a single JSON array. | |
| local label="$1" outfile="$2"; shift 2 | |
| echo "--- ${label} -> ${outfile} ---" | tee -a "${LOG}" | |
| gh api --paginate --jq '.[]' "$@" \ | |
| | python3 -c "import json,sys; print(json.dumps(list(map(json.loads, sys.stdin))))" \ | |
| > "${outfile}" | |
| printf 'bytes=%s\n\n' "$(stat -c%s "${outfile}" 2>/dev/null || wc -c < "${outfile}")" | tee -a "${LOG}" | |
| } | |
| pull_paginated "pulls/${PR}/reviews" "${OUTDIR}/pr${PR}_reviews_${TS}.json" \ | |
| "repos/${REPO}/pulls/${PR}/reviews" | |
| pull_paginated "pulls/${PR}/comments (inline)" "${OUTDIR}/pr${PR}_review_comments_${TS}.json" \ | |
| "repos/${REPO}/pulls/${PR}/comments" | |
| pull_paginated "issues/${PR}/comments (top-level)" "${OUTDIR}/pr${PR}_issue_comments_${TS}.json" \ | |
| "repos/${REPO}/issues/${PR}/comments" | |
| pull_paginated "pulls/${PR}/files" "${OUTDIR}/pr${PR}_files_${TS}.json" \ | |
| "repos/${REPO}/pulls/${PR}/files" | |
| pull_paginated "pulls/${PR}/commits" "${OUTDIR}/pr${PR}_commits_${TS}.json" \ | |
| "repos/${REPO}/pulls/${PR}/commits" |
…ulls, doc cosmetics Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Make corpus pull + build_corpus mandatory in the skill workflow so reviewers cannot skip cross-PR pattern grounding when reviewing audio Curator PRs. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…nvironment Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Summary
Adds
review-curator-audio-pr, a self-contained Cursor skill that helps ahuman reviewer review someone else's audio-modality Curator PR. It is an
assistant for the person doing the review - not an auto-approving bot, and not a
tool for the PR author to answer reviewers. The human stays in control: the skill
gathers context and drafts findings, and the reviewer reads, edits, and decides
what to actually post.
Repo-only (
gh+git).How it works (high level)
Given just a PR number, the skill runs this end to end:
(
ensure_repo.sh).ghit fetches thePR metadata, the diff / changed files, the commits, and every prior review,
inline comment, and top-level comment on the PR, plus the GraphQL review
threads so each existing comment is tagged OPEN / OUTDATED / RESOLVED. No
manual copy-pasting of the PR or its review history.
the audio knowledge sources (docs, code map, review lenses).
critique.
path:lineevidence andconcrete fixes, skipping anything other reviewers already raised.
post. The skill posts nothing on its own.
So the automation is in the pulling and analysis (the skill pulls the PR and
all comments on it for you); the posting stays a human decision.
Audio-only: scope is
nemo_curator/stages/audio/,nemo_curator/tasks/audio_task.py,tutorials/audio/,tests/stages/audio/,and audio benchmarks. The pull/digest steps abort on a PR that touches no audio
path, so the skill never reviews a non-audio PR.
Who it's for
and a first-pass set of findings to review, refine, and post.
human doesn't duplicate, and cites the canonical docs/rules behind each point.
anything on its own - the reviewer chooses what to submit.
Contents (
.cursor/skills/review-curator-audio-pr/)SKILL.md— reviewer workflow: locate → pull → diff → read knowledge sources→ overview-first → lenses + P0–P3 findings, plus how to provide a
gh-authenticated environment.knowledge-sources.md— audio docs (README, dev-guide slides, fern docs,.cursor/rules, CONTRIBUTING), code map, review lenses, severity guide, andthe post-refactor(audio): AudioBatch -> AudioTask single-dict redesign #1608 reviewer-comment corpus workflow.
templates.md— output layout (overview, digest, threads, comments).scripts/—ensure_repo.sh,pr_review_pull.sh(audio-only),build_digest.py(audio-only),pull_audio_pr_corpus.sh(incremental;--refreshto re-pull),build_corpus.py,README.md..gitignore— ignore scratch dir.curator-pr-review/.Test plan
bash -n/ast.parsefor all scripts.build_digest.py: aborts on non-audio PR; overview renders beforefindings; correct OPEN/OUTDATED/RESOLVED.
pull_audio_pr_corpus.sh: incremental skip of PRs already on disk.build_corpus.pyend-to-end on synthetic data.