[SKILL.md Chore] make .agents/ the cannonical agent-skills location#1362
[SKILL.md Chore] make .agents/ the cannonical agent-skills location#1362shljessie wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (44)
✅ Files skipped from review due to trivial changes (7)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughMakes Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 8/10 reviews remaining, refill in 10 minutes and 8 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.agents/clusters.yaml.example (1)
13-17: Consider documenting all supported optional cluster keys in the example.
remote_exec.shalso readscontainer_imageandenv_type; adding them as commented examples here would reduce guesswork for first-time setup.💡 Suggested template addition
workspace: /path/to/remote/workdir gpu_type: H100 # used for quantization format recommendation + # env_type: auto # optional: auto|slurm|docker|bare + # container_image: nvcr.io/nvidia/pytorch:xx.xx-py3 # optional (docker mode) # slurm: # default_account: my_account # default_partition: batch_short🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/clusters.yaml.example around lines 13 - 17, Add the optional cluster keys that remote_exec.sh reads to the example YAML so users see all supported keys; specifically include commented examples for container_image and env_type alongside existing keys like workspace, gpu_type and the slurm block, showing typical values and short inline comments (e.g., container_image: my-repo/image:tag and env_type: conda|venv|docker) so first-time setup is clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/skills/common/remote_exec.sh:
- Around line 44-47: Update the documentation in
.agents/skills/common/remote-execution.md to match the lookup order implemented
in the script: document that the project-level clusters file is checked at
.agents/clusters.yaml first (canonical) and then .claude/clusters.yaml for
backward compatibility, mirroring the comment and variables in remote_exec.sh
(see user_config and the project-level lookup comments). Make the same wording
change in the other referenced sections (around the other occurrences noted) so
all mentions consistently show ".agents/clusters.yaml" as preferred and
".claude/clusters.yaml" as back-compat.
---
Nitpick comments:
In @.agents/clusters.yaml.example:
- Around line 13-17: Add the optional cluster keys that remote_exec.sh reads to
the example YAML so users see all supported keys; specifically include commented
examples for container_image and env_type alongside existing keys like
workspace, gpu_type and the slurm block, showing typical values and short inline
comments (e.g., container_image: my-repo/image:tag and env_type:
conda|venv|docker) so first-time setup is clearer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 31344d7b-2cf1-4d55-ac87-9dbaf8013136
📒 Files selected for processing (48)
.agents/README.md.agents/clusters.yaml.example.agents/scripts/sync-upstream-skills.sh.agents/skills/accessing-mlflow/SKILL.md.agents/skills/common/credentials.md.agents/skills/common/environment-setup.md.agents/skills/common/remote-execution.md.agents/skills/common/remote_exec.sh.agents/skills/common/slurm-setup.md.agents/skills/common/workspace-management.md.agents/skills/debug/SKILL.md.agents/skills/deployment/SKILL.md.agents/skills/deployment/references/setup.md.agents/skills/deployment/references/sglang.md.agents/skills/deployment/references/support-matrix.md.agents/skills/deployment/references/trtllm.md.agents/skills/deployment/references/unsupported-models.md.agents/skills/deployment/references/vllm.md.agents/skills/deployment/scripts/deploy.sh.agents/skills/deployment/tests/evals.json.agents/skills/evaluation/SKILL.md.agents/skills/evaluation/references/model-card-research.md.agents/skills/evaluation/references/multi-node.md.agents/skills/evaluation/references/quantization-benchmarks.md.agents/skills/evaluation/tests/evals.json.agents/skills/launching-evals/SKILL.md.agents/skills/launching-evals/references/analyze-results.md.agents/skills/launching-evals/references/benchmarks/swebench-general-info.md.agents/skills/launching-evals/references/benchmarks/terminal-bench-general-info.md.agents/skills/launching-evals/references/benchmarks/terminal-bench-trace-analysis.md.agents/skills/launching-evals/references/check-progress.md.agents/skills/launching-evals/references/debug-failed-runs.md.agents/skills/launching-evals/references/run-evaluation.md.agents/skills/launching-evals/tests.json.agents/skills/monitor/SKILL.md.agents/skills/ptq/SKILL.md.agents/skills/ptq/references/checkpoint-validation.md.agents/skills/ptq/references/launcher-guide.md.agents/skills/ptq/references/slurm-setup-ptq.md.agents/skills/ptq/references/unsupported-models.md.agents/skills/ptq/tests.json.agents/skills/release-cherry-pick/SKILL.md.claude/clusters.yaml.example.claude/clusters.yaml.example.claude/scripts.claude/skills.markdownlint-cli2.yamlCLAUDE.md
NVIDIA-internal references in
|
| # | File:line | Finding | Recommended action |
|---|---|---|---|
| H1 | .agents/skills/launching-evals/SKILL.md:65 |
Real internal Slurm account names coreai_dlalgo_compeval and coreai_dlalgo_llm shown as the example values for the "PPP → X" rename pattern. |
Replace with placeholders, e.g. <old_account> → <new_account>, or generic names like team_dl_eval → team_dl_llm. But see "vendored-skill caveat" below. |
| H2 | .agents/skills/launching-evals/SKILL.md:67 |
Hard-coded internal lustre layout: /lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface. This path schema (/lustre/fsw/portfolios/<group>/users/<user>/) is the canonical NVIDIA cluster layout. |
Generalise to <your_hf_cache_path> or $HF_HOME and a parenthetical "(e.g. on lustre clusters: /lustre/...)". Vendored-skill caveat applies. |
| H3 | .agents/skills/launching-evals/references/debug-failed-runs.md:73 and :83 |
Drop ":5005" from GitLab container registry URLs. Port 5005 is the standard port for gitlab-master.nvidia.com:5005 (NV's internal GitLab container registry). The advice is meaningless outside that environment, so it's a strong tell. |
Either remove the lines, or rephrase to "Drop the registry port from GitLab container registry URLs (e.g. :5005 on some on-prem GitLab instances)". Vendored-skill caveat applies. |
| H4 | .agents/skills/common/slurm-setup.md:218 |
enroot/pyxis ... NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT). Hits two of the keywords on your list (eos, selene) plus an internal codename (GCP-NRT), and explicitly labels them "NVIDIA internal". |
Replace with vendor-neutral wording, e.g. enroot/pyxis ... HPC clusters with container runtime support (DGX Cloud and similar) .... This file is local (not vendored), so safe to edit directly. |
|
Pinging for review — this is a docs / repo-housekeeping change moving @kaix-nv Wondering if you could take a look? |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
This is a clean repo-housekeeping PR that moves agent skill files from .claude/ to .agents/ as the canonical location, with symlinks for backward compatibility. The approach (canonical dir + symlinks) is the simplest possible solution — no new code or abstractions. Most of the 47 files are pure renames with no content changes; actual new/modified content is ~127 lines of documentation and path updates.
One minor issue found: a broken/misleading link in the README. Otherwise the changes look correct and well-documented.
| Edits via the symlink work, but the diff will look like changes to | ||
| `.agents/...` either way; editing the canonical path makes that explicit. | ||
| - Vendored-verbatim skills (`launching-evals`, `accessing-mlflow`) are managed | ||
| by `.agents/scripts/sync-upstream-skills.sh` — do not modify by hand. |
There was a problem hiding this comment.
Bot comment.
Nit: The link text references .cursor/skills-cursor/create-skill/SKILL.md (a path in this repo) but the URL points to https://docs.anthropic.com/ (Anthropic's docs site). These don't match — the link text suggests a local file while the URL goes to an external docs page. Either point to the actual repo path or update the link text to match the Anthropic docs URL.
There was a problem hiding this comment.
Updated! I've removed this link.
|
@shljessie please make sure to sign-off your commits so DCO check can pass. Take a look at https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md#%EF%B8%8F-signing-your-work |
|
Should we also rename |
| - Vendored-verbatim skills (`launching-evals`, `accessing-mlflow`) are managed | ||
| by `.agents/scripts/sync-upstream-skills.sh` — do not modify by hand. | ||
| - New skills go in `.agents/skills/<skill-name>/SKILL.md` following the | ||
| conventions documented in [`.cursor/skills-cursor/create-skill/SKILL.md`](https://docs.anthropic.com/) (or your agent's equivalent). |
There was a problem hiding this comment.
Updated! I've removed this link.
| ## A note on Windows | ||
|
|
||
| Git stores symlinks portably, but Windows requires either Developer Mode or | ||
| `git config --global core.symlinks true` plus admin rights for them to | ||
| materialise correctly. If you're on Windows and skills aren't being picked | ||
| up under `.claude/skills/`, that's the most likely cause — `.agents/skills/` | ||
| will still work directly. |
There was a problem hiding this comment.
I agree the shared .agents/skills/ direction is good, but I’d prefer not to make committed symlinks since symlinks are fragile across Windows checkouts, Docker build contexts, CI tools, and some linters/tooling. Could we instead make .agents/skills/ the canonical path and have each agent’s guidance/config point directly to that path?
There was a problem hiding this comment.
That sounds good. So for this PR I will remove the symlinks and update .agents/README.md to drop the symlink guidance.
|
I’d prefer to install the skills via the marketplace. I have a PR for Claude Code: #1141. We need to test later whether the canonical agent skills can be installed in both Claude Code and Codex. |
Signed-off-by: Seonghee Lee <seongheel@nvidia.com>
Surfaced by an internal-keyword scan over .agents/skills/. All four findings replaced with vendor-neutral wording: - launching-evals/SKILL.md: replace concrete Slurm account names (coreai_dlalgo_compeval / coreai_dlalgo_llm) used as the "PPP -> X" rename example with placeholders <old_account> / <new_account>. - launching-evals/SKILL.md: generalise the HF cache path from /lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface to HF_HOME=<your_hf_cache_path>, with a parenthetical note that lustre- style HPC clusters typically organise this under /lustre/.../<group>/users/<username>/... - launching-evals/references/debug-failed-runs.md: rephrase the "Drop ':5005' from GitLab container registry URLs" advice (port 5005 is the standard port for an on-prem GitLab container registry; the raw advice only made sense in that context) to a vendor-neutral "If the image is on an on-prem GitLab registry, drop the registry port suffix (e.g. ':5005') from the URL." Applied at both occurrences. - common/slurm-setup.md: change the enroot/pyxis "Typical clusters" cell from "NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT)" to "HPC clusters with container runtime (e.g. DGX Cloud and similar Slurm + container setups)" -- removes internal cluster codenames (EOS, Selene, GCP-NRT) and the "NVIDIA internal" label. Caveat: the three launching-evals/* files are vendored verbatim from NVIDIA-NeMo/Evaluator (per the provenance header injected by .agents/scripts/sync-upstream-skills.sh). The next sync will overwrite them. Follow-ups: (1) upstream MR against NVIDIA-NeMo/Evaluator, and/or (2) add a redaction post-process to sync-upstream-skills.sh so the scrub survives re-syncs. Signed-off-by: Seonghee Lee <seongheel@nvidia.com> Made-with: Cursor Signed-off-by: Seonghee Lee <seongheel@nvidia.com>
Signed-off-by: Seonghee Lee <seongheel@nvidia.com>
Signed-off-by: Seonghee Lee <seongheel@nvidia.com>
Signed-off-by: Seonghee Lee <seongheel@nvidia.com>
665ff90 to
510f36d
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1362 +/- ##
==========================================
- Coverage 76.93% 75.68% -1.25%
==========================================
Files 471 471
Lines 50404 52311 +1907
==========================================
+ Hits 38776 39591 +815
- Misses 11628 12720 +1092
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I prefer to update the source in marketplace.json to point directly to .agents/skills. |
What does this PR do?
Type of change: documentation
Centralizing skills under .agents/ (with .claude/, .codex/, .cursor/, … as symlinks) gives us a single source of truth so the same SKILL.md works across every coding agent without maintaining N copies that drift out of sync.
repo-root/ ├── .agents/ ← canonical source of truth │ ├── README.md │ ├── clusters.yaml.example │ ├── scripts/ │ │ └── sync-upstream-skills.sh │ └── skills/ │ ├── accessing-mlflow/SKILL.md │ ├── common/ │ ├── debug/SKILL.md │ ├── deployment/SKILL.md │ ├── evaluation/SKILL.md │ ├── launching-evals/SKILL.md │ ├── monitor/SKILL.md │ ├── ptq/SKILL.md │ └── release-cherry-pick/SKILL.md │ ├── .claude/ ← back-compat (all symlinks) │ ├── clusters.yaml.example → ../.agents/clusters.yaml.example │ ├── scripts → ../.agents/scripts │ └── skills → ../.agents/skills │ └── (future agents — add a symlink, no copies) ├── .codex/skills → ../.agents/skills └── .cursor/skills → ../.agents/skillsUsage
No user-facing API change. For developers/agents working in this repo:
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.)..claude/skills/, .claude/scripts/, and .claude/clusters.yaml.example continue to resolve to the same content via symlinks. Claude Code's auto-discovery of skills under .claude/skills/ is unaffected. remote_exec.sh still accepts .claude/clusters.yaml.
repo housekeeping only, no API/feature/bugfix change. Happy to add a Misc line if maintainers prefer.
Additional Information
Summary by CodeRabbit