Skip to content

Add pre-built evaluation recipes for common benchmarks#1357

Open
kaix-nv wants to merge 4 commits intomainfrom
kaix/eval-recipes
Open

Add pre-built evaluation recipes for common benchmarks#1357
kaix-nv wants to merge 4 commits intomainfrom
kaix/eval-recipes

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Apr 27, 2026

What does this PR do?

Type of change: ?

  • Add pre-built NEL evaluation recipes for 7 common benchmarks, so users can evaluate checkpoints without building configs from scratch.
  • 7 task configs: MMLU, MMLU-Pro, GPQA Diamond, AIME 2025, LiveCodeBench v6, IFBench, SciCode

Usage

# Add a code snippet demonstrating how to use this
evaluate my checkpoint at <model_path> with MMLU-Pro on cluster <cluster_name>

Testing

 I'll follow the workflow. Since you asked for MMLU-Pro specifically, let me check for a pre-built recipe first, and
  also check workspace + nel.

● Bash(echo "MODELOPT_WORKSPACE_ROOT=${MODELOPT_WORKSPACE_ROOT:-(unset)}"; nel --version 2>&1 | head -5; ls
      /home/scratch.kaix_coreai/workspace/trt_model_optimizer_dev…)
  ⎿  MODELOPT_WORKSPACE_ROOT=(unset)
     [I 2026-04-27T19:49:51.534] Centralized logging configured (console only) console_level=INFO log_dir=none (NEMO
     _EVALUATOR_LOG_DIR not set) logger=nemo_evaluator.logging.utils
     … +8 lines (ctrl+o to expand)

● Found mmlu_pro.yaml recipe. Let me read it and check the checkpoint quantization.

  Read 3 files (ctrl+o to expand)

● Let me check the checkpoint contents.

  Listed 1 directory (ctrl+o to expand)

● NVFP4 quantization detected (use --quantization modelopt_fp4). Let me check the model config and the existing YAML in
  this repo, then check cluster auth.

● Bash(ssh gcp-nrt-cs-001-login-002.nvidia.com "cat /home/kaix/experiments/cicd/cicd_1776029569/Nemotron3_Nano_30B_NVFP4_
      0/code/modules/Model-Optimizer/examples/llm_pt…)
  ⎿  {
         "architectures": [
     "NemotronHForCausalLM"
     … +42 lines (ctrl+o to expand)

  Read 1 file (ctrl+o to expand)

● Bash(ssh gcp-nrt-cs-001-login-002.nvidia.com "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null; echo
      '---HOME---'; echo \$HOME" 2>&1)
  ⎿  machine gitlab-master.nvidia.com login kaix password glpat-aZSkjoxeD4U5Gyqe-d49
     machine nvcr.io login $oauthtoken password dTV0MjRmcmI5a2p2bTd2dHJjcWtyaDRqMzA6MmEwNjgzMjctZGNiMS00MDdhLThiYWQt
     NGNjZTJlYzk0Mjgw
     … +6 lines (ctrl+o to expand)

● There's already a complete config slurm_vllm_base_general_knowledge.yaml matching your request exactly. Let me check
  env setup, then run.

● HF_TOKEN is set. Existing config is at slurm_vllm_base_general_knowledge.yaml and matches your request exactly:
  - cluster gcp-nrt-cs-001-login-002.nvidia.com, account coreai_dlalgo_modelopt
  - NVFP4 quantization flag (--quantization modelopt_fp4)
  - task adlr_mmlu_pro_5_shot_base
  - NGC vLLM image (DockerHub creds aren't set on the cluster, but NGC is)

  Running dry-run, then submitting with limit_samples=2.

● Bash(nel run --config slurm_vllm_base_general_knowledge.yaml --dry-run 2>&1 | tail -60)
  ⎿  #   api_endpoint:
     #     model_id: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-NVFP4
     #     type: completions
     … +83 lines (ctrl+o to expand)

● Dry-run looks clean. Submitting the smoke test with limit_samples=2.

● Bash(nel run --config slurm_vllm_base_general_knowledge.yaml -o
      ++evaluation.nemo_evaluator_config.config.params.limit_samples=2 2>&1 | tail -40)
  ⎿  [I 2026-04-27T19:51:12.897] Using merged IRs total_tasks=421 internal_tasks=0 external_tasks=421 mapping_verifi
     ed=True
     [I 2026-04-27T19:51:12.897] Loaded tasks from packaged IRs num_tasks=421 mapping_verified=True
     … +42 lines (ctrl+o to expand)

● Smoke test submitted. SLURM job 307748, invocation 589584b3a4ee503c.

  - Config: slurm_vllm_base_general_knowledge.yaml (existing in workspace, already wired for this checkpoint)
  - Task: adlr_mmlu_pro_5_shot_base, limit_samples=2
  - vLLM image: nvcr.io/nvidia/vllm:26.03-py3, NVFP4 via --quantization modelopt_fp4 --trust-remote-code
  - Output: /home/kaix/experiments/nel/20260427_195112-589584b3a4ee503c/ on the cluster

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added a “pre-built recipe” shortcut to run benchmarks with fewer prompts while keeping model auto-detection and required user completions.
    • Automatic deployment settings detection (sequence-length and remote-code trust) augmented by model-card signals and merged with recipe defaults.
    • Added seven pre-configured benchmark recipes and an example quantization evaluation recipe.
  • Documentation

    • Updated environment/token setup guidance to point to the provided example env file.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a benchmark-aware pre-built recipe path that, when a matching recipe exists, skips interactive base-config, auto-detects deployment args from checkpoint config.json and model-card signals, prompts for any required ??? fields, then proceeds to registry/auth and evaluation. Updates env guidance to recipes/env.example.

Changes

Cohort / File(s) Summary
Skill instruction
.claude/skills/evaluation/SKILL.md
Implements a pre-built recipe shortcut: detect recipe in recipes/tasks/, bypass interactive base-config/task confirmation, run checkpoint config.json and model-card WebSearch to construct/augment deployment.extra_args, prompt user to fill remaining ??? fields, then perform registry/auth and evaluation. Updates Step 8 to reference recipes/env.example.
Example recipe
.claude/skills/evaluation/recipes/examples/example_eval.yaml
Adds an example Slurm + vLLM evaluation recipe with HF token passthrough, NeMo evaluator settings, and multiple registered benchmark tasks with per-task overrides.
Benchmark task recipes
.claude/skills/evaluation/recipes/tasks/*.yaml
aime2025.yaml, gpqa.yaml, ifbench.yaml, livecodebench.yaml, mmlu.yaml, mmlu_pro.yaml, scicode.yaml
Adds seven new benchmark recipes. Each defines Slurm execution defaults, vLLM deployment placeholders (including checkpoint_path and served_model_name), tensor/data-parallel settings, default --max-model-len 32768 in deployment.extra_args, NeMo evaluator parameters (timeouts/retries/parallelism), and task-specific overrides (dataset splits, repeats, adapter param removals).
Env example
.claude/skills/evaluation/recipes/env.example
Introduces an .env example documenting required tokens/keys (e.g., HF_TOKEN, DUMMY_API_KEY) and optional task-specific placeholders; instructs copying to .env or exporting for recipe runs.

Sequence Diagram(s)

sequenceDiagram
  participant User as User
  participant Skill as Evaluation Skill
  participant Recipes as Recipes Store
  participant Checkpoint as Checkpoint Config Reader
  participant ModelCard as Model-card WebSearch
  participant ConfigGen as Config Generator
  participant Registry as Registry/Auth
  participant Runner as Evaluation Runner

  User->>Skill: Request benchmark run (e.g., "MMLU-Pro")
  Skill->>Recipes: Check for matching recipe in `recipes/tasks/`
  alt recipe found
    Recipes-->>Skill: Return recipe
    Skill->>User: Prompt to fill any required `???` fields
    Skill->>Checkpoint: Read checkpoint `config.json` (if provided)
    Checkpoint-->>Skill: Return detected settings (e.g., max_position_embeddings)
    Skill->>ModelCard: Query model-card WebSearch for flags/signals
    ModelCard-->>Skill: Return inferred flags (reasoning/tool-calling,vLLM flags)
    Skill->>ConfigGen: Merge recipe + checkpoint + model-card + user inputs -> final config
    ConfigGen-->>Skill: Return final config
    Skill->>Registry: Perform registry/auth checks
    Registry-->>Skill: Auth ok
    Skill->>Runner: Start evaluation run with final config
    Runner-->>User: Run started / results (async)
  else no recipe
    Skill->>User: Start interactive base-config build and confirmations
    User-->>Skill: Provide answers
    Skill->>ConfigGen: Build config from interactive inputs
    ConfigGen-->>Skill: Return final config
    Skill->>Registry: Perform registry/auth checks
    Registry-->>Skill: Auth ok
    Skill->>Runner: Start evaluation run with final config
    Runner-->>User: Run started / results (async)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding pre-built evaluation recipes for common benchmarks, which is the primary purpose of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR contains only YAML configuration, markdown documentation, and environment files with zero Python code changes, so security anti-patterns cannot manifest.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/eval-recipes

Comment @coderabbitai help to get the list of available commands and usage tips.

@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch 2 times, most recently from edd67e1 to 00f37de Compare April 27, 2026 22:11
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.48%. Comparing base (c07ac21) to head (4e5db92).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1357      +/-   ##
==========================================
- Coverage   76.49%   76.48%   -0.01%     
==========================================
  Files         471      471              
  Lines       50487    50487              
==========================================
- Hits        38622    38617       -5     
- Misses      11865    11870       +5     
Flag Coverage Δ
unit 52.78% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaix-nv kaix-nv marked this pull request as ready for review April 28, 2026 03:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
.claude/skills/evaluation/recipes/tasks/aime2025.yaml (1)

40-40: Consider bounding request timeout to SLURM walltime.

On Line 40, request_timeout: 100000 (≈27.8h) is far above the 4h walltime and can delay failure signals for stuck calls. Aligning timeout with walltime (or slightly below) improves reliability diagnostics.

Proposed adjustment
-        request_timeout: 100000
+        request_timeout: 14400
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/recipes/tasks/aime2025.yaml at line 40, The
request_timeout currently set to 100000 seconds greatly exceeds the SLURM
walltime; change the YAML key request_timeout to a value at or slightly below
the SLURM walltime (e.g., set request_timeout: 14000 to align with a 4h walltime
minus a buffer) so stuck calls fail before the job walltime expires and
diagnostic signals are timely.
.claude/skills/evaluation/recipes/tasks/gpqa.yaml (1)

13-60: Extract shared recipe scaffold to reduce drift.

This file repeats the same execution/deployment/evaluation baseline used across multiple new task recipes. Consider a shared base YAML (e.g., chat-base/completions-base) and keep only task-specific overrides here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/recipes/tasks/gpqa.yaml around lines 13 - 60, This
recipe duplicates the common scaffolding under keys defaults, execution,
deployment, and evaluation; extract that baseline into a shared base YAML (e.g.,
chat-base or completions-base) and make this file only contain task-specific
overrides (keep the tasks list with name: ns_gpqa and its nemo_evaluator_config)
by replacing the duplicated sections with a reference to the shared base (via
YAML include/anchor or your repo's recipe inheritance mechanism) and only
override checkpoint_path, hf_model_handle, served_model_name, extra_args,
tensor_parallel_size/data_parallel_size, and the ns_gpqa-specific
nemo_evaluator_config extras; ensure the merged result preserves env_vars
HF_TOKEN and the adapter_config params_to_remove.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/recipes/examples/example_eval.yaml:
- Line 13: The commented usage command references the wrong filename; update the
comment line that currently calls
"recipes/examples/quantization_validation.yaml" so it instead references
"recipes/examples/example_eval.yaml" (the current file name) to avoid
file-not-found errors when users copy/paste the command; ensure the rest of the
command string remains unchanged.

In @.claude/skills/evaluation/recipes/tasks/livecodebench.yaml:
- Around line 8-12: Update the usage example to include the two required
overrides missing from the nel run command: add -o
execution.output_dir=/path/to/output (or a placeholder) and -o
deployment.served_model_name=<model_name> so the command supplies values for
execution.output_dir and deployment.served_model_name referenced in the YAML;
ensure the placeholders match the pattern used for other overrides (e.g.,
/path/to/output and <model_name>).
- Line 33: Remove --trust-remote-code from the default extra_args entry (the
line setting extra_args: --max-model-len 32768 --trust-remote-code) and instead
add an explicit, optional configuration flag (e.g., trust_remote_code: false or
enable_trust_remote_code: false) that users can opt into; update
livecodebench.yaml to use only --max-model-len in extra_args and document/read
the new trust_remote_code option before appending --trust-remote-code at runtime
when true. Apply the same change pattern to the other recipe files mentioned
(aime2025, gpqa, ifbench, mmlu, mmlu_pro, scicode) so none enable
--trust-remote-code by default and the code path that builds command-line args
checks the new trust_remote_code flag to conditionally append
--trust-remote-code.

In @.claude/skills/evaluation/SKILL.md:
- Around line 43-44: The shortcut wording currently says to "skip Steps 2-5"
which incorrectly bypasses Step 3 (quantization/model handling) and Step 4
(filling required `???` fields); update the text around the pre-built recipe
shortcut so it only skips base-config/task-confirmation steps but explicitly
retains execution of model/quantization handling and placeholder/`???`
completion before jumping to Step 7.5/8; reference the recipe lookup
(recipes/tasks/) and explicitly call out running the model handling flow (Step
3) and placeholder filling (Step 4) when a recipe is used.

---

Nitpick comments:
In @.claude/skills/evaluation/recipes/tasks/aime2025.yaml:
- Line 40: The request_timeout currently set to 100000 seconds greatly exceeds
the SLURM walltime; change the YAML key request_timeout to a value at or
slightly below the SLURM walltime (e.g., set request_timeout: 14000 to align
with a 4h walltime minus a buffer) so stuck calls fail before the job walltime
expires and diagnostic signals are timely.

In @.claude/skills/evaluation/recipes/tasks/gpqa.yaml:
- Around line 13-60: This recipe duplicates the common scaffolding under keys
defaults, execution, deployment, and evaluation; extract that baseline into a
shared base YAML (e.g., chat-base or completions-base) and make this file only
contain task-specific overrides (keep the tasks list with name: ns_gpqa and its
nemo_evaluator_config) by replacing the duplicated sections with a reference to
the shared base (via YAML include/anchor or your repo's recipe inheritance
mechanism) and only override checkpoint_path, hf_model_handle,
served_model_name, extra_args, tensor_parallel_size/data_parallel_size, and the
ns_gpqa-specific nemo_evaluator_config extras; ensure the merged result
preserves env_vars HF_TOKEN and the adapter_config params_to_remove.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea292879-464b-4698-9c32-abacd70a6689

📥 Commits

Reviewing files that changed from the base of the PR and between 6e08b13 and 00f37de.

📒 Files selected for processing (9)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/tasks/gpqa.yaml
  • .claude/skills/evaluation/recipes/tasks/ifbench.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml

Comment thread .claude/skills/evaluation/recipes/examples/example_eval.yaml Outdated
Comment thread .claude/skills/evaluation/recipes/tasks/livecodebench.yaml Outdated
Comment thread .claude/skills/evaluation/recipes/tasks/livecodebench.yaml Outdated
Comment thread .claude/skills/evaluation/SKILL.md Outdated
@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch from 00f37de to 3d6039a Compare April 28, 2026 06:26
@kaix-nv kaix-nv requested review from cjluo-nv, meenchen and mxinO April 28, 2026 06:26
@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch from 3d6039a to b805c9a Compare April 28, 2026 06:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.claude/skills/evaluation/SKILL.md (1)

43-44: ⚠️ Potential issue | 🟠 Major

Shortcut instruction still skips mandatory completion steps.

Line 43 still says to skip Steps 2–5, which bypasses Step 3 (model/quantization handling) and Step 4 (filling required ??? values), so recipe-based runs can remain incomplete.

Suggested wording update
-**Shortcut: use a pre-built recipe.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching recipe. Available: mmlu, mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. If found, skip Steps 2-5 — go directly to the recipe, fill in deployment overrides, and proceed to Step 7.5/8.
+**Shortcut: use a pre-built recipe.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching recipe. Available: mmlu, mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. If found, skip Step 2 (base config build) and Step 5 (task confirmation), start from the matched recipe, then run Step 3 (model path + quantization detection) and Step 4 (fill remaining `???` values), and proceed to Step 7.5/8.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 43 - 44, The current
shortcut sentence ("If the user asks... skip Steps 2-5") incorrectly allows
recipe-based runs to bypass mandatory Steps 3 (model/quantization handling) and
4 (filling required `???` values); update the wording so that when a matching
recipe in recipes/tasks/ (e.g., mmlu, mmlu_pro, gpqa, ...) is used you still
require completion of Steps 3 and 4 (and any other mandatory steps) before
proceeding to Step 7.5/8 — keep the shortcut to locate and use the recipe but
explicitly state that model/quantization selection (Step 3) and filling required
placeholders (Step 4) must still be performed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 43-44: The current shortcut sentence ("If the user asks... skip
Steps 2-5") incorrectly allows recipe-based runs to bypass mandatory Steps 3
(model/quantization handling) and 4 (filling required `???` values); update the
wording so that when a matching recipe in recipes/tasks/ (e.g., mmlu, mmlu_pro,
gpqa, ...) is used you still require completion of Steps 3 and 4 (and any other
mandatory steps) before proceeding to Step 7.5/8 — keep the shortcut to locate
and use the recipe but explicitly state that model/quantization selection (Step
3) and filling required placeholders (Step 4) must still be performed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d1ca2c71-a047-4623-8b07-6feb01c655be

📥 Commits

Reviewing files that changed from the base of the PR and between 00f37de and 3d6039a.

📒 Files selected for processing (9)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/tasks/gpqa.yaml
  • .claude/skills/evaluation/recipes/tasks/ifbench.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml
✅ Files skipped from review due to trivial changes (3)
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • .claude/skills/evaluation/recipes/tasks/ifbench.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kaix-nv, could you share how to handle tasks like AA_LCR that require API Key?

tensor_parallel_size: 1
data_parallel_size: 1
# For models with custom code, add: --trust-remote-code
extra_args: --max-model-len 32768
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we let the agent decide the extra_args required based on the model card/config of the checkpoint? e.g., model-len, tool-call-parser, reasoning-parser ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section for auto-detecting deployment settings from checkpoint.

kaix-nv added 2 commits April 29, 2026 14:53
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch from b805c9a to a702423 Compare April 29, 2026 21:53
@kaix-nv
Copy link
Copy Markdown
Contributor Author

kaix-nv commented Apr 29, 2026

Thanks @kaix-nv, could you share how to handle tasks like AA_LCR that require API Key?

I've added an env.example with all possible API key.

Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch from a702423 to 0f608a8 Compare April 29, 2026 21:57
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/recipes/tasks/ifbench.yaml:
- Around line 8-14: The usage snippet in ifbench.yaml omits required auth
environment variables (HF_TOKEN and DUMMY_API_KEY) referenced elsewhere (the
recipe's deployment/serving steps around the blocks that read HF_TOKEN at lines
~29, DUMMY_API_KEY at ~39 and ~48); update the Usage example to show setting
these before running (either export HF_TOKEN=... and export DUMMY_API_KEY=... or
prefix the nel run command with HF_TOKEN=... DUMMY_API_KEY=...), so users have
the required credentials available when invoking the recipe.

In @.claude/skills/evaluation/SKILL.md:
- Around line 136-140: The doc currently implies automatically enabling
--trust-remote-code when auto_map exists; change this to explicitly warn against
auto-enabling and instruct readers to require explicit user confirmation before
setting --trust-remote-code. Update the table row referencing `auto_map` and
`--trust-remote-code` to state “Do not enable by default; require explicit
confirmation and verification of model provenance (trusted source)”, and add a
short advisory note referencing vLLM security best practices and the RCE risks
(auto_map) so maintainers/users must opt-in after verification.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e8bee44d-b8a1-4aed-b7b5-83455132f183

📥 Commits

Reviewing files that changed from the base of the PR and between b805c9a and a702423.

📒 Files selected for processing (10)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/tasks/gpqa.yaml
  • .claude/skills/evaluation/recipes/tasks/ifbench.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml
✅ Files skipped from review due to trivial changes (8)
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
  • .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
  • .claude/skills/evaluation/recipes/tasks/scicode.yaml
  • .claude/skills/evaluation/recipes/tasks/aime2025.yaml
  • .claude/skills/evaluation/recipes/tasks/gpqa.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu.yaml
  • .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml

Comment on lines +8 to +14
# Usage:
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Usage example misses required environment setup.

The command snippet omits HF_TOKEN/DUMMY_API_KEY setup even though this recipe depends on them (Lines 29, 39, 48). A direct copy-paste run can fail on auth.

Suggested usage block update
 # Usage:
+#   cp recipes/env.example .env
+#   # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
+#   set -a && source .env && set +a
 #   nel run --config recipes/tasks/ifbench.yaml \
 #     -o deployment.checkpoint_path=/path/to/checkpoint \
 #     -o execution.hostname=<slurm_host> \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Usage:
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
# Usage:
# cp recipes/env.example .env
# # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
# set -a && source .env && set +a
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/recipes/tasks/ifbench.yaml around lines 8 - 14,
The usage snippet in ifbench.yaml omits required auth environment variables
(HF_TOKEN and DUMMY_API_KEY) referenced elsewhere (the recipe's
deployment/serving steps around the blocks that read HF_TOKEN at lines ~29,
DUMMY_API_KEY at ~39 and ~48); update the Usage example to show setting these
before running (either export HF_TOKEN=... and export DUMMY_API_KEY=... or
prefix the nel run command with HF_TOKEN=... DUMMY_API_KEY=...), so users have
the required credentials available when invoking the recipe.

Comment on lines +136 to +140
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In vLLM, what are the security implications of using --trust-remote-code, and is it recommended to enable it by default based only on Hugging Face config.json auto_map?

💡 Result:

Using --trust-remote-code in vLLM allows execution of untrusted Python code from Hugging Face model repositories during model loading and tokenizer initialization. This poses severe security risks, including remote code execution (RCE), as attackers can craft malicious models that execute arbitrary code on the host system when loaded. Multiple CVEs highlight this danger, even when the flag is intended to be False: - CVE-2025-66448 and GHSA-8fr4-5q9j-m8gm: auto_map in config.json triggers get_class_from_dynamic_module, fetching and executing remote code bypassing trust_remote_code=False (fixed in v0.11.1+ with domain validation). - CVE-2026-27893 and GHSA-7972-pg2x-xr59: Hardcoded trust_remote_code=True in model files (e.g., Nemotron-VL, Kimi-K25) overrides user setting (fixed in v0.18.0+). - GHSA-2pc9-4j83-qjmr: Unconditional auto_map resolution without trust_remote_code check (fixed in v0.14.0). Official vLLM docs confirm the flag defaults to False in both LLM class and vllm serve CLI, explicitly for trusting remote code from Hugging Face. It is not recommended to enable --trust-remote-code by default, especially based solely on config.json auto_map, as this field has been exploited for RCE. Only enable explicitly for trusted models requiring custom code, after verifying the repository. Use the latest vLLM version (e.g., v0.20.0 as of 2026-04-27) with patches applied, and prefer models without remote code needs.

Citations:


Don't auto-enable --trust-remote-code based on auto_map without explicit user confirmation.

Automatically enabling this flag when auto_map exists in config.json creates a remote code execution (RCE) risk. The vLLM security advisories (CVE-2025-66448, CVE-2026-27893, GHSA-8fr4-5q9j-m8gm) document multiple instances where attackers exploited auto_map to execute arbitrary code during model loading. Official vLLM documentation explicitly recommends keeping this flag disabled by default. Only enable after explicit user confirmation and verification that the model is from a trusted source.

Suggested wording adjustment
-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072``--max-model-len 131072` |
| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072``--max-model-len 131072` |
| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 136 - 140, The doc currently
implies automatically enabling --trust-remote-code when auto_map exists; change
this to explicitly warn against auto-enabling and instruct readers to require
explicit user confirmation before setting --trust-remote-code. Update the table
row referencing `auto_map` and `--trust-remote-code` to state “Do not enable by
default; require explicit confirmation and verification of model provenance
(trusted source)”, and add a short advisory note referencing vLLM security best
practices and the RCE risks (auto_map) so maintainers/users must opt-in after
verification.

target:
api_endpoint:
api_key_name: DUMMY_API_KEY
tasks:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for keeping one yaml file for each task? Can we put them together and let AI agent compose the target set of benchmarks? Also, for other benchmarks like tau2? Can AI agent compose a working config without an example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to let the agent compose the target set of benchmarks. So it's more flexible to run a single task directly or compose them into a suite by copying recipes/examples/example_eval.yaml. Keep one working config may not be flexible since some configs are not needed by some users.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think if we only keep the tasks part for these tasks/< benchmark>.yaml since other config should be the same across benchmarks? It will reduce token usage and keep the other setup consistent across benchmarks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion. I've stripped each task file to just the task config, and create one shared base config.

@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch from b3219f0 to 98f4a97 Compare April 30, 2026 00:21
…ase config

Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/eval-recipes branch from 98f4a97 to 4e5db92 Compare April 30, 2026 00:30
@mxinO
Copy link
Copy Markdown
Contributor

mxinO commented Apr 30, 2026

The current default repeats is not enough, this will produce mostly noise. Suggestion, AIME->64, GPQA->10, LCB->10.
Can it follow the command e.g. "Eval GPQA with 10 repeats"?.

@@ -0,0 +1,122 @@
# Example: Quantization Validation Suite
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEL config yaml may change quite frequently. Is this yaml for demo purpose or for day 0 model evals (internal usage)?

Also some of the evals requires pinned eval docker image and specific settings for apple-to-apple comparison.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These recipes are for demo purpose. The task snippets that the agent composes into a working config. If NEL configs change and something breaks, the agent will diagnose and fix the incompatibility at runtime.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For evals that require pinned docker images and specific settings for apple-to-apple comparison, users can override container.

@kaix-nv
Copy link
Copy Markdown
Contributor Author

kaix-nv commented Apr 30, 2026

The current default repeats is not enough, this will produce mostly noise. Suggestion, AIME->64, GPQA->10, LCB->10. Can it follow the command e.g. "Eval GPQA with 10 repeats"?.

I followed the NV_eval internal example recipe and agree that increasing repetitions improves robustness to noise. It might be useful to let users adjust the number of repetitions themselves if the current setting feels too noisy, similar to how Claude Code allows changing repetitions via chat.

@mxinO
Copy link
Copy Markdown
Contributor

mxinO commented May 1, 2026

The current default repeats is not enough, this will produce mostly noise. Suggestion, AIME->64, GPQA->10, LCB->10. Can it follow the command e.g. "Eval GPQA with 10 repeats"?.

I followed the NV_eval internal example recipe and agree that increasing repetitions improves robustness to noise. It might be useful to let users adjust the number of repetitions themselves if the current setting feels too noisy, similar to how Claude Code allows changing repetitions via chat.

Yeah, the current value is not useful for our dashboard, if we want to track ptq loss, it will be covered by noise. so we should at least add a note to agent to warn users, saying something like "The default repeats will give benchmarks with large variation, it is used for quick verification of the model's ability, if you want precise benchmark values, I suggest use much larger number of repeats, but that will take much longer time to eval".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants