Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,14 @@ Test that `nel` is installed with `nel --version`. If not, instruct the user to

If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running.

**Shortcut: use pre-built task snippets.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching task snippet. Available: mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. Task snippets contain only the task-specific config (name, params, repeats) — not the full NEL config. To use them:

1. Read the task snippet(s) the user wants
2. Use `recipes/examples/example_eval.yaml` as the base config template
3. Replace the `tasks:` section with the selected snippet(s)
4. Do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in `???` values)
5. Proceed to Step 7.5/8

**Step 2: Build the base config file**

Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
Expand Down Expand Up @@ -123,6 +131,29 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi

> **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below.

**Auto-detect deployment settings from checkpoint:**

Read `config.json` from the checkpoint (or HF model card) and build `deployment.extra_args` dynamically:

```bash
cat <checkpoint_path>/config.json 2>/dev/null
```

| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |

Comment on lines +142 to +146
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In vLLM, what are the security implications of using --trust-remote-code, and is it recommended to enable it by default based only on Hugging Face config.json auto_map?

💡 Result:

Using --trust-remote-code in vLLM allows execution of untrusted Python code from Hugging Face model repositories during model loading and tokenizer initialization. This poses severe security risks, including remote code execution (RCE), as attackers can craft malicious models that execute arbitrary code on the host system when loaded. Multiple CVEs highlight this danger, even when the flag is intended to be False: - CVE-2025-66448 and GHSA-8fr4-5q9j-m8gm: auto_map in config.json triggers get_class_from_dynamic_module, fetching and executing remote code bypassing trust_remote_code=False (fixed in v0.11.1+ with domain validation). - CVE-2026-27893 and GHSA-7972-pg2x-xr59: Hardcoded trust_remote_code=True in model files (e.g., Nemotron-VL, Kimi-K25) overrides user setting (fixed in v0.18.0+). - GHSA-2pc9-4j83-qjmr: Unconditional auto_map resolution without trust_remote_code check (fixed in v0.14.0). Official vLLM docs confirm the flag defaults to False in both LLM class and vllm serve CLI, explicitly for trusting remote code from Hugging Face. It is not recommended to enable --trust-remote-code by default, especially based solely on config.json auto_map, as this field has been exploited for RCE. Only enable explicitly for trusted models requiring custom code, after verifying the repository. Use the latest vLLM version (e.g., v0.20.0 as of 2026-04-27) with patches applied, and prefer models without remote code needs.

Citations:


Don't auto-enable --trust-remote-code based on auto_map without explicit user confirmation.

Automatically enabling this flag when auto_map exists in config.json creates a remote code execution (RCE) risk. The vLLM security advisories (CVE-2025-66448, CVE-2026-27893, GHSA-8fr4-5q9j-m8gm) document multiple instances where attackers exploited auto_map to execute arbitrary code during model loading. Official vLLM documentation explicitly recommends keeping this flag disabled by default. Only enable after explicit user confirmation and verification that the model is from a trusted source.

Suggested wording adjustment
-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072``--max-model-len 131072` |
| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072``--max-model-len 131072` |
| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 136 - 140, The doc currently
implies automatically enabling --trust-remote-code when auto_map exists; change
this to explicitly warn against auto-enabling and instruct readers to require
explicit user confirmation before setting --trust-remote-code. Update the table
row referencing `auto_map` and `--trust-remote-code` to state “Do not enable by
default; require explicit confirmation and verification of model provenance
(trusted source)”, and add a short advisory note referencing vLLM security best
practices and the RCE risks (auto_map) so maintainers/users must opt-in after
verification.

Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings:

| Model card signal | What to set |
| --- | --- |
| Reasoning model (thinking/CoT) | `--reasoning-parser` and `--reasoning-parser-plugin` if a custom parser is provided |
| Tool-calling support | `--enable-auto-tool-choice --tool-call-parser <parser>` |
| Custom vLLM flags documented | Add as specified (e.g., `--mamba_ssm_cache_dtype float32`) |

Combine all detected flags into a single `deployment.extra_args` override. The recipe's default `--max-model-len 32768` is a fallback — always prefer the value from `config.json`.

**Quantization-aware benchmark defaults:**

When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
Expand Down Expand Up @@ -218,7 +249,13 @@ ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"

Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.

**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
**Important**: Export required environment variables based on your config. If any tokens or keys are missing, point the user to `recipes/env.example` — it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it:

```bash
cp recipes/env.example .env
# Edit .env with your keys
set -a && source .env && set +a
```

```bash
# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
Expand Down
29 changes: 29 additions & 0 deletions .claude/skills/evaluation/recipes/env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Evaluation API Keys
#
# Copy this file and fill in the keys you need:
# cp recipes/env.example .env
# # Edit .env with your keys
# set -a && source .env && set +a
#
# Not all keys are required — only fill in what your tasks need.

# Required for all tasks (model/dataset downloads)
HF_TOKEN=hf_...

# Required for nemo_skills.* tasks (dummy value, not a real key)
DUMMY_API_KEY=dummy

# Required for NEL pre_cmd execution
NEMO_EVALUATOR_TRUST_PRE_CMD=1

# --- Optional: task-specific keys ---

# AIME 2025 (simple_evals variant only, not ns_aime2025)
# JUDGE_API_KEY=

# tau2_bench_telecom (LLM judge)
# JUDGE_API_KEY_NVDEV_QWEN235B=

# terminal-bench-hard (AWS sandbox)
# AWS_ACCESS_KEY_ID=
# AWS_SECRET_ACCESS_KEY=
122 changes: 122 additions & 0 deletions .claude/skills/evaluation/recipes/examples/example_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Example: Quantization Validation Suite
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEL config yaml may change quite frequently. Is this yaml for demo purpose or for day 0 model evals (internal usage)?

Also some of the evals requires pinned eval docker image and specific settings for apple-to-apple comparison.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These recipes are for demo purpose. The task snippets that the agent composes into a working config. If NEL configs change and something breaks, the agent will diagnose and fix the incompatibility at runtime.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For evals that require pinned docker images and specific settings for apple-to-apple comparison, users can override container.

#
# A balanced set of benchmarks for validating quantized model quality.
# Copy this file and customize for your needs.
# Task snippets in recipes/tasks/ define per-task configs — the agent
# composes them into a runnable config like this one.
#
# Includes:
# - MMLU-Pro (knowledge, completions)
# - GPQA Diamond (reasoning, chat, 5 repeats)
# - LiveCodeBench v6 (code, chat, 3 repeats)
# - IFBench (instruction following, chat, 8 repeats)
#
# Usage:
# nel run --config recipes/examples/example_eval.yaml \
# -o deployment.checkpoint_path=/path/to/quantized/checkpoint \
# -o deployment.served_model_name=my-model-nvfp4 \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output
#
# For quantized checkpoints, also add the quantization flag:
# -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code --quantization modelopt_fp4'
#
# Run a single task:
# nel run --config ... -t ns_gpqa
#
# Smoke test (2 samples):
# nel run --config ... -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=2
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
execution:
hostname: ???
username: ${oc.env:USER}
account: ???
output_dir: ???
walltime: "04:00:00"
mounts:
mount_home: false
deployment:
env_vars:
HF_TOKEN: host:HF_TOKEN
checkpoint_path: ???
hf_model_handle:
served_model_name: ???
tensor_parallel_size: 1
data_parallel_size: 1
# For models with custom code, add: --trust-remote-code
extra_args: --max-model-len 32768
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we let the agent decide the extra_args required based on the model card/config of the checkpoint? e.g., model-len, tool-call-parser, reasoning-parser ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section for auto-detecting deployment settings from checkpoint.

evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
nemo_evaluator_config:
config:
params:
request_timeout: 3600
max_retries: 10
parallelism: 16
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
tasks:
# Knowledge (chat endpoint, short)
- name: ns_mmlu_pro
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 1
args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens

# Reasoning (chat endpoint, 5 repeats, short)
- name: ns_gpqa
nemo_evaluator_config:
config:
params:
extra:
args: ++prompt_config=eval/aai/mcq-4choices
num_repeats: 5
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens

# Code (chat endpoint, 3 repeats, medium)
- name: ns_livecodebench
nemo_evaluator_config:
config:
params:
extra:
dataset_split: test_v6_2408_2505
num_repeats: 3
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens

# Instruction following (chat endpoint, 8 repeats, super short)
- name: ns_ifbench
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 8
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
19 changes: 19 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/aime2025.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# AIME 2025 (NeMo Skills, chat)
# Primary metric: pass@1[avg-of-16] symbolic_correct
# Run time: Long (reasoning models generate lengthy thinking traces) | Repeats: 16
# Note: The AA variant (simple_evals.AIME_2025) requires JUDGE_API_KEY.
# This NeMo Skills variant uses symbolic scoring — no external API keys needed.
- name: ns_aime2025
nemo_evaluator_config:
config:
params:
request_timeout: 100000
max_retries: 10
extra:
num_repeats: 16
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
16 changes: 16 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/gpqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# GPQA Diamond (NeMo Skills, chat)
# Primary metric: pass@1[avg-of-5] symbolic_correct
# Run time: Short | Repeats: 5
- name: ns_gpqa
nemo_evaluator_config:
config:
params:
extra:
args: ++prompt_config=eval/aai/mcq-4choices
num_repeats: 5
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
15 changes: 15 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/ifbench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# IFBench (NeMo Skills, chat)
# Primary metric: pass@1[avg-of-8] prompt_strict_accuracy
# Run time: Super Short | Repeats: 8
- name: ns_ifbench
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 8
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
17 changes: 17 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# LiveCodeBench v6 (NeMo Skills, chat)
# Primary metric: pass@1[avg-of-3] accuracy
# Run time: Medium | Repeats: 3
- name: ns_livecodebench
nemo_evaluator_config:
config:
params:
max_retries: 10
extra:
dataset_split: test_v6_2408_2505
num_repeats: 3
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
16 changes: 16 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# MMLU-Pro (NeMo Skills, chat)
# Primary metric: symbolic_correct
# Run time: Short | Repeats: 1
- name: ns_mmlu_pro
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 1
args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
16 changes: 16 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/scicode.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# SciCode (NeMo Skills, chat)
# Primary metric: pass@1[avg-of-3] subtask_accuracy
# Run time: Long | Repeats: 3
- name: ns_scicode
nemo_evaluator_config:
config:
params:
max_retries: 10
extra:
num_repeats: 3
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
Loading