diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 1dd8aa27067..69920814828 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -40,6 +40,14 @@ Test that `nel` is installed with `nel --version`. If not, instruct the user to If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running. +**Shortcut: use pre-built task snippets.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching task snippet. Available: mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. Task snippets contain only the task-specific config (name, params, repeats) — not the full NEL config. To use them: + +1. Read the task snippet(s) the user wants +2. Use `recipes/examples/example_eval.yaml` as the base config template +3. Replace the `tasks:` section with the selected snippet(s) +4. Do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in `???` values) +5. Proceed to Step 7.5/8 + **Step 2: Build the base config file** Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion: @@ -123,6 +131,29 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi > **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below. +**Auto-detect deployment settings from checkpoint:** + +Read `config.json` from the checkpoint (or HF model card) and build `deployment.extra_args` dynamically: + +```bash +cat /config.json 2>/dev/null +``` + +| Field in `config.json` | What to set | Example | +| --- | --- | --- | +| `max_position_embeddings` | `--max-model-len ` | `131072` → `--max-model-len 131072` | +| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code | + +Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings: + +| Model card signal | What to set | +| --- | --- | +| Reasoning model (thinking/CoT) | `--reasoning-parser` and `--reasoning-parser-plugin` if a custom parser is provided | +| Tool-calling support | `--enable-auto-tool-choice --tool-call-parser ` | +| Custom vLLM flags documented | Add as specified (e.g., `--mamba_ssm_cache_dtype float32`) | + +Combine all detected flags into a single `deployment.extra_args` override. The recipe's default `--max-model-len 32768` is a fallback — always prefer the value from `config.json`. + **Quantization-aware benchmark defaults:** When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include. @@ -218,7 +249,13 @@ ssh "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run. -**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands. +**Important**: Export required environment variables based on your config. If any tokens or keys are missing, point the user to `recipes/env.example` — it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it: + +```bash +cp recipes/env.example .env +# Edit .env with your keys +set -a && source .env && set +a +``` ```bash # If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands): diff --git a/.claude/skills/evaluation/recipes/env.example b/.claude/skills/evaluation/recipes/env.example new file mode 100644 index 00000000000..8d9b9bfa6d9 --- /dev/null +++ b/.claude/skills/evaluation/recipes/env.example @@ -0,0 +1,29 @@ +# Evaluation API Keys +# +# Copy this file and fill in the keys you need: +# cp recipes/env.example .env +# # Edit .env with your keys +# set -a && source .env && set +a +# +# Not all keys are required — only fill in what your tasks need. + +# Required for all tasks (model/dataset downloads) +HF_TOKEN=hf_... + +# Required for nemo_skills.* tasks (dummy value, not a real key) +DUMMY_API_KEY=dummy + +# Required for NEL pre_cmd execution +NEMO_EVALUATOR_TRUST_PRE_CMD=1 + +# --- Optional: task-specific keys --- + +# AIME 2025 (simple_evals variant only, not ns_aime2025) +# JUDGE_API_KEY= + +# tau2_bench_telecom (LLM judge) +# JUDGE_API_KEY_NVDEV_QWEN235B= + +# terminal-bench-hard (AWS sandbox) +# AWS_ACCESS_KEY_ID= +# AWS_SECRET_ACCESS_KEY= diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml new file mode 100644 index 00000000000..77887b3f8c3 --- /dev/null +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -0,0 +1,122 @@ +# Example: Quantization Validation Suite +# +# A balanced set of benchmarks for validating quantized model quality. +# Copy this file and customize for your needs. +# Task snippets in recipes/tasks/ define per-task configs — the agent +# composes them into a runnable config like this one. +# +# Includes: +# - MMLU-Pro (knowledge, completions) +# - GPQA Diamond (reasoning, chat, 5 repeats) +# - LiveCodeBench v6 (code, chat, 3 repeats) +# - IFBench (instruction following, chat, 8 repeats) +# +# Usage: +# nel run --config recipes/examples/example_eval.yaml \ +# -o deployment.checkpoint_path=/path/to/quantized/checkpoint \ +# -o deployment.served_model_name=my-model-nvfp4 \ +# -o execution.hostname= \ +# -o execution.account= \ +# -o execution.output_dir=/path/to/output +# +# For quantized checkpoints, also add the quantization flag: +# -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code --quantization modelopt_fp4' +# +# Run a single task: +# nel run --config ... -t ns_gpqa +# +# Smoke test (2 samples): +# nel run --config ... -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=2 +defaults: + - execution: slurm/default + - deployment: vllm + - _self_ +execution: + hostname: ??? + username: ${oc.env:USER} + account: ??? + output_dir: ??? + walltime: "04:00:00" + mounts: + mount_home: false +deployment: + env_vars: + HF_TOKEN: host:HF_TOKEN + checkpoint_path: ??? + hf_model_handle: + served_model_name: ??? + tensor_parallel_size: 1 + data_parallel_size: 1 + # For models with custom code, add: --trust-remote-code + extra_args: --max-model-len 32768 +evaluation: + env_vars: + HF_TOKEN: host:HF_TOKEN + nemo_evaluator_config: + config: + params: + request_timeout: 3600 + max_retries: 10 + parallelism: 16 + target: + api_endpoint: + api_key_name: DUMMY_API_KEY + tasks: + # Knowledge (chat endpoint, short) + - name: ns_mmlu_pro + nemo_evaluator_config: + config: + params: + extra: + num_repeats: 1 + args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens + + # Reasoning (chat endpoint, 5 repeats, short) + - name: ns_gpqa + nemo_evaluator_config: + config: + params: + extra: + args: ++prompt_config=eval/aai/mcq-4choices + num_repeats: 5 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens + + # Code (chat endpoint, 3 repeats, medium) + - name: ns_livecodebench + nemo_evaluator_config: + config: + params: + extra: + dataset_split: test_v6_2408_2505 + num_repeats: 3 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens + + # Instruction following (chat endpoint, 8 repeats, super short) + - name: ns_ifbench + nemo_evaluator_config: + config: + params: + extra: + num_repeats: 8 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens diff --git a/.claude/skills/evaluation/recipes/tasks/aime2025.yaml b/.claude/skills/evaluation/recipes/tasks/aime2025.yaml new file mode 100644 index 00000000000..1cf5643f481 --- /dev/null +++ b/.claude/skills/evaluation/recipes/tasks/aime2025.yaml @@ -0,0 +1,19 @@ +# AIME 2025 (NeMo Skills, chat) +# Primary metric: pass@1[avg-of-16] symbolic_correct +# Run time: Long (reasoning models generate lengthy thinking traces) | Repeats: 16 +# Note: The AA variant (simple_evals.AIME_2025) requires JUDGE_API_KEY. +# This NeMo Skills variant uses symbolic scoring — no external API keys needed. + - name: ns_aime2025 + nemo_evaluator_config: + config: + params: + request_timeout: 100000 + max_retries: 10 + extra: + num_repeats: 16 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens diff --git a/.claude/skills/evaluation/recipes/tasks/gpqa.yaml b/.claude/skills/evaluation/recipes/tasks/gpqa.yaml new file mode 100644 index 00000000000..3692175d987 --- /dev/null +++ b/.claude/skills/evaluation/recipes/tasks/gpqa.yaml @@ -0,0 +1,16 @@ +# GPQA Diamond (NeMo Skills, chat) +# Primary metric: pass@1[avg-of-5] symbolic_correct +# Run time: Short | Repeats: 5 + - name: ns_gpqa + nemo_evaluator_config: + config: + params: + extra: + args: ++prompt_config=eval/aai/mcq-4choices + num_repeats: 5 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens diff --git a/.claude/skills/evaluation/recipes/tasks/ifbench.yaml b/.claude/skills/evaluation/recipes/tasks/ifbench.yaml new file mode 100644 index 00000000000..46cbc2db085 --- /dev/null +++ b/.claude/skills/evaluation/recipes/tasks/ifbench.yaml @@ -0,0 +1,15 @@ +# IFBench (NeMo Skills, chat) +# Primary metric: pass@1[avg-of-8] prompt_strict_accuracy +# Run time: Super Short | Repeats: 8 + - name: ns_ifbench + nemo_evaluator_config: + config: + params: + extra: + num_repeats: 8 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens diff --git a/.claude/skills/evaluation/recipes/tasks/livecodebench.yaml b/.claude/skills/evaluation/recipes/tasks/livecodebench.yaml new file mode 100644 index 00000000000..202387a1eb6 --- /dev/null +++ b/.claude/skills/evaluation/recipes/tasks/livecodebench.yaml @@ -0,0 +1,17 @@ +# LiveCodeBench v6 (NeMo Skills, chat) +# Primary metric: pass@1[avg-of-3] accuracy +# Run time: Medium | Repeats: 3 + - name: ns_livecodebench + nemo_evaluator_config: + config: + params: + max_retries: 10 + extra: + dataset_split: test_v6_2408_2505 + num_repeats: 3 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens diff --git a/.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml b/.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml new file mode 100644 index 00000000000..be16a546a39 --- /dev/null +++ b/.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml @@ -0,0 +1,16 @@ +# MMLU-Pro (NeMo Skills, chat) +# Primary metric: symbolic_correct +# Run time: Short | Repeats: 1 + - name: ns_mmlu_pro + nemo_evaluator_config: + config: + params: + extra: + num_repeats: 1 + args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens diff --git a/.claude/skills/evaluation/recipes/tasks/scicode.yaml b/.claude/skills/evaluation/recipes/tasks/scicode.yaml new file mode 100644 index 00000000000..724b6935759 --- /dev/null +++ b/.claude/skills/evaluation/recipes/tasks/scicode.yaml @@ -0,0 +1,16 @@ +# SciCode (NeMo Skills, chat) +# Primary metric: pass@1[avg-of-3] subtask_accuracy +# Run time: Long | Repeats: 3 + - name: ns_scicode + nemo_evaluator_config: + config: + params: + max_retries: 10 + extra: + num_repeats: 3 + target: + api_endpoint: + adapter_config: + params_to_remove: + - max_new_tokens + - max_completion_tokens