Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .agents/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# `.agents/` — agent-agnostic source of truth

This directory is the canonical location for assets shared by AI coding agents
working in this repository (Claude Code, Codex, Cursor, …).

## Layout

```text
.agents/
├── skills/ # SKILL.md files (canonical)
│ └── <skill-name>/SKILL.md
├── scripts/ # shared helper scripts (sync-upstream-skills.sh, …)
└── clusters.yaml.example # remote-cluster config template
```

## Why this exists

Different agents look for skills/config in vendor-specific directories. Rather
than maintaining N copies that drift out of sync, **`.agents/` is the single
source of truth** — each agent's guidance or install mechanism points here
directly.

## Editing rules

- **Always edit files under `.agents/`**.
- Vendored-verbatim skills (`launching-evals`, `accessing-mlflow`) are managed
by `.agents/scripts/sync-upstream-skills.sh` — do not modify by hand.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Nit: The link text references .cursor/skills-cursor/create-skill/SKILL.md (a path in this repo) but the URL points to https://docs.anthropic.com/ (Anthropic's docs site). These don't match — the link text suggests a local file while the URL goes to an external docs page. Either point to the actual repo path or update the link text to match the Anthropic docs URL.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! I've removed this link.

- New skills go in `.agents/skills/<skill-name>/SKILL.md` following the
conventions of existing skills (e.g. `.agents/skills/monitor/SKILL.md`).

## Project-level cluster config

The remote-execution skills look for a `clusters.yaml` at, in order:

1. `~/.config/modelopt/clusters.yaml` (user-level, recommended)
2. `<repo-root>/.agents/clusters.yaml` (project-level, canonical)
3. `<repo-root>/.claude/clusters.yaml` (project-level, back-compat)

See `clusters.yaml.example` for the schema.
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# ModelOpt Remote Cluster Configuration
# Copy to ~/.config/modelopt/clusters.yaml (user-level, recommended)
# or .claude/clusters.yaml (project-level, can be committed).
# or .agents/clusters.yaml (project-level, can be committed).
# .claude/clusters.yaml is also accepted for back-compat.

clusters:
# GPU workstation or SLURM login node
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,18 @@
# NOT managed by this script — update it manually when pulling upstream changes.
#
# Usage:
# .claude/scripts/sync-upstream-skills.sh # re-vendor at the pinned SHA
# UPSTREAM_SHA=<sha> .claude/scripts/sync-upstream-skills.sh # bump to a new SHA
# .agents/scripts/sync-upstream-skills.sh # re-vendor at the pinned SHA
# UPSTREAM_SHA=<sha> .agents/scripts/sync-upstream-skills.sh # bump to a new SHA
#
# Requires: gh, base64, awk. Run from the repo root.
#
# The script overwrites .claude/skills/<skill>/ with upstream contents and
# The script overwrites .agents/skills/<skill>/ with upstream contents and
# re-applies our provenance lines into each SKILL.md frontmatter. If you have
# local changes to a vendored skill, they will be lost — that is expected,
# since vendored-verbatim skills should not be modified locally.
#
# Note: .claude/skills/ (and other agent-specific skill dirs) are symlinks to
# .agents/skills/ — see .agents/README.md.

set -euo pipefail

Expand All @@ -40,7 +43,7 @@ SHORT_SHA="${SHA:0:7}"

UPSTREAM_REPO="NVIDIA-NeMo/Evaluator"
UPSTREAM_BASE="packages/nemo-evaluator-launcher/.claude/skills"
DEST_BASE=".claude/skills"
DEST_BASE=".agents/skills"

if [[ ! -d "$DEST_BASE" ]]; then
echo "error: run from the repo root (expected $DEST_BASE/ to exist)" >&2
Expand Down Expand Up @@ -116,7 +119,7 @@ inject_provenance() {
print "license: Apache-2.0"
print "# Vendored verbatim from NVIDIA NeMo Evaluator (commit " short ")"
print "# https://github.com/NVIDIA-NeMo/Evaluator/tree/" sha "/packages/nemo-evaluator-launcher/.claude/skills/" skill
print "# To re-sync: .claude/scripts/sync-upstream-skills.sh"
print "# To re-sync: .agents/scripts/sync-upstream-skills.sh"
if (extra != "") {
n = split(extra, lines, "\\|")
for (i = 1; i <= n; i++) print "# " lines[i]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ Read this when Claude Code runs on a different machine than the target GPU clust
Config locations (checked in order, first found wins):

1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended)
2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults)
3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding
2. `.agents/clusters.yaml` — project-level, canonical (can be committed for shared defaults)
3. `.claude/clusters.yaml` — project-level, back-compat
4. Interactive input — if no file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding

```yaml
clusters:
Expand Down Expand Up @@ -38,7 +39,7 @@ rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoi

Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.

See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
See `.agents/clusters.yaml.example` for a fully annotated example with multiple cluster types.

---

Expand Down Expand Up @@ -153,5 +154,5 @@ remote_sync_from <remote_output_subdir> /local/output/
## Reference Files

- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers)
- **`.claude/clusters.yaml`** — Active cluster configuration
- **`.claude/clusters.yaml.example`** — Annotated example config
- **`.agents/clusters.yaml`** — Active cluster configuration (canonical; `.claude/clusters.yaml` also accepted for back-compat)
- **`.agents/clusters.yaml.example`** — Annotated example config
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,17 @@
# ── Helpers ──────────────────────────────────────────────────────────────────

_remote_config_file() {
# Find clusters.yaml: user-level > project-level
# Find clusters.yaml: user-level > project-level.
# Project-level is checked at .agents/clusters.yaml (canonical) and then
# .claude/clusters.yaml (back-compat).
local user_config="${HOME}/.config/modelopt/clusters.yaml"
Comment thread
coderabbitai[bot] marked this conversation as resolved.
local project_config
# Walk up from pwd looking for .claude/clusters.yaml
local dir="$PWD"
while [[ "$dir" != "/" ]]; do
if [[ -f "$dir/.agents/clusters.yaml" ]]; then
project_config="$dir/.agents/clusters.yaml"
break
fi
if [[ -f "$dir/.claude/clusters.yaml" ]]; then
project_config="$dir/.claude/clusters.yaml"
break
Expand Down Expand Up @@ -196,7 +201,7 @@ remote_load_cluster() {
if [[ -z "$config_file" ]]; then
echo "ERROR: No clusters.yaml found. Provide cluster info interactively or create one." >&2
echo " User config: ~/.config/modelopt/clusters.yaml" >&2
echo " Project config: .claude/clusters.yaml" >&2
echo " Project config: .agents/clusters.yaml (or .claude/clusters.yaml)" >&2
return 1
fi

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ which docker 2>/dev/null && echo "RUNTIME=docker"

| Runtime | Typical clusters | SLURM integration |
| --- | --- | --- |
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
| **enroot/pyxis** | HPC clusters with container runtime (e.g. DGX Cloud and similar Slurm + container setups) | `srun --container-image` |
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |

### Step 2: Check credentials for the image's registry
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,9 @@ The complete evaluation workflow is divided into the following steps you should
# Key Facts

- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
- **PPP** = Slurm account / project portfolio code (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `<old_account>` → `<new_account>`).
- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=<your_hf_cache_path> hf download <model>` (on lustre-style HPC clusters this is typically under `/lustre/.../<group>/users/<username>/cache/huggingface`). Without this, vLLM will fail with `LocalEntryNotFoundError`.
- **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
- **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
- **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ tail -200 $LOGS/client-*.log
- **CUDA OOM**: Increase `deployment.tensor_parallel_size` to shard across more GPUs. For multi-node: increase `execution.num_nodes` and set `deployment.pipeline_parallel_size`. As last resort: add `--max-model-len <lower_value>` to `deployment.extra_args`. Do NOT quantize as a first fix — scale compute instead.
- **Missing model/checkpoint**: `FileNotFoundError` or `RepositoryNotFoundError` or `GatedRepoError: 403` — verify `deployment.checkpoint_path` or `deployment.hf_model_handle`. For gated models, set `HF_TOKEN` via `deployment.env_vars`.
- **Bad `extra_args`**: `unrecognized arguments` or `unexpected keyword argument` — check flags against deployment engine version. Some flags change between versions (e.g., `--rope-scaling` removed in vLLM > 0.11.0).
- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. Drop `:5005` from GitLab container registry URLs.
- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. If the image is on an on-prem GitLab registry, drop the registry port suffix (e.g. `:5005`) from the URL.
- **GPU driver mismatch**: `CUDA driver version is insufficient` — use an older container image matching the host CUDA driver.
- **Health check timeout / connection refused**: Server didn't start — check server logs first. Increase `execution.endpoint_readiness_timeout` (seconds). SLURM default: `null` (falls back to walltime).
- **Server crashed mid-eval**: `Connection reset by peer` — check server logs for OOM. Reduce `parallelism` (concurrent requests). Check SLURM logs for preemption or walltime exceeded.
Expand All @@ -80,7 +80,7 @@ tail -200 $LOGS/client-*.log
- **Config validation**: `MissingMandatoryValue` (unfilled `???`), `ValidationError` (type mismatch), `ScannerError` (invalid YAML) — run `--dry-run` to catch these upfront.
- **Walltime exceeded**: `CANCELLED DUE TO TIME LIMIT` — NEL submits paired restart jobs that automatically resume when walltime expires, so this is often expected behavior, not a failure. Only increase `execution.walltime` if the evaluation isn't making progress across restarts.
- **Preemption**: `CANCELLED DUE TO PREEMPTION` — the paired restart job should automatically resume. If it doesn't, use non-preemptible partition, or re-run.
- **Container not found**: Applies to both `deployment.image` and task-level eval container. Drop `:5005` from GitLab registry URLs.
- **Container not found**: Applies to both `deployment.image` and task-level eval container. For on-prem GitLab registries, drop the registry port suffix (e.g. `:5005`) from the URL.
- Troubleshooting docs: list files with WebFetch `https://api.github.com/repos/NVIDIA-NeMo/Evaluator/contents/docs/troubleshooting`, then fetch relevant ones from `https://raw.githubusercontent.com/NVIDIA-NeMo/Evaluator/main/docs/troubleshooting/<file>`

**Fix Slurm invalid account/partition:**
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions .markdownlint-cli2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ config:
MD059: false # no-hard-tabs

# Vendored upstream skills — kept byte-identical to upstream via
# .claude/scripts/sync-upstream-skills.sh; do not reformat.
# .agents/scripts/sync-upstream-skills.sh; do not reformat.
ignores:
- ".claude/skills/launching-evals/**"
- ".claude/skills/accessing-mlflow/**"
- ".agents/skills/launching-evals/**"
- ".agents/skills/accessing-mlflow/**"
5 changes: 5 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch,
> If a `CLAUDE.local.md` file exists alongside this file, read and respect it — it contains
> developer-specific overrides that supplement this shared guidance.

> **Skills live in `.agents/skills/`** — `.claude/skills/` is a symlink to
> `.agents/skills/` for back-compat. See `.agents/README.md` for the convention
> (used to share skills/scripts/cluster-config across Claude Code, Codex, Cursor,
> etc.). Always edit files under `.agents/`, not the symlink path.

## Rules (Read First)

**CRITICAL (YOU MUST):**
Expand Down
Loading