Add pre-built evaluation recipes for common benchmarks#1357
Add pre-built evaluation recipes for common benchmarks#1357
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a benchmark-aware pre-built recipe path that, when a matching recipe exists, skips interactive base-config, auto-detects deployment args from checkpoint Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Skill as Evaluation Skill
participant Recipes as Recipes Store
participant Checkpoint as Checkpoint Config Reader
participant ModelCard as Model-card WebSearch
participant ConfigGen as Config Generator
participant Registry as Registry/Auth
participant Runner as Evaluation Runner
User->>Skill: Request benchmark run (e.g., "MMLU-Pro")
Skill->>Recipes: Check for matching recipe in `recipes/tasks/`
alt recipe found
Recipes-->>Skill: Return recipe
Skill->>User: Prompt to fill any required `???` fields
Skill->>Checkpoint: Read checkpoint `config.json` (if provided)
Checkpoint-->>Skill: Return detected settings (e.g., max_position_embeddings)
Skill->>ModelCard: Query model-card WebSearch for flags/signals
ModelCard-->>Skill: Return inferred flags (reasoning/tool-calling,vLLM flags)
Skill->>ConfigGen: Merge recipe + checkpoint + model-card + user inputs -> final config
ConfigGen-->>Skill: Return final config
Skill->>Registry: Perform registry/auth checks
Registry-->>Skill: Auth ok
Skill->>Runner: Start evaluation run with final config
Runner-->>User: Run started / results (async)
else no recipe
Skill->>User: Start interactive base-config build and confirmations
User-->>Skill: Provide answers
Skill->>ConfigGen: Build config from interactive inputs
ConfigGen-->>Skill: Return final config
Skill->>Registry: Perform registry/auth checks
Registry-->>Skill: Auth ok
Skill->>Runner: Start evaluation run with final config
Runner-->>User: Run started / results (async)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
edd67e1 to
00f37de
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1357 +/- ##
==========================================
- Coverage 76.49% 76.48% -0.01%
==========================================
Files 471 471
Lines 50487 50487
==========================================
- Hits 38622 38617 -5
- Misses 11865 11870 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
.claude/skills/evaluation/recipes/tasks/aime2025.yaml (1)
40-40: Consider bounding request timeout to SLURM walltime.On Line 40,
request_timeout: 100000(≈27.8h) is far above the 4h walltime and can delay failure signals for stuck calls. Aligning timeout with walltime (or slightly below) improves reliability diagnostics.Proposed adjustment
- request_timeout: 100000 + request_timeout: 14400🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/recipes/tasks/aime2025.yaml at line 40, The request_timeout currently set to 100000 seconds greatly exceeds the SLURM walltime; change the YAML key request_timeout to a value at or slightly below the SLURM walltime (e.g., set request_timeout: 14000 to align with a 4h walltime minus a buffer) so stuck calls fail before the job walltime expires and diagnostic signals are timely..claude/skills/evaluation/recipes/tasks/gpqa.yaml (1)
13-60: Extract shared recipe scaffold to reduce drift.This file repeats the same execution/deployment/evaluation baseline used across multiple new task recipes. Consider a shared base YAML (e.g., chat-base/completions-base) and keep only task-specific overrides here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/recipes/tasks/gpqa.yaml around lines 13 - 60, This recipe duplicates the common scaffolding under keys defaults, execution, deployment, and evaluation; extract that baseline into a shared base YAML (e.g., chat-base or completions-base) and make this file only contain task-specific overrides (keep the tasks list with name: ns_gpqa and its nemo_evaluator_config) by replacing the duplicated sections with a reference to the shared base (via YAML include/anchor or your repo's recipe inheritance mechanism) and only override checkpoint_path, hf_model_handle, served_model_name, extra_args, tensor_parallel_size/data_parallel_size, and the ns_gpqa-specific nemo_evaluator_config extras; ensure the merged result preserves env_vars HF_TOKEN and the adapter_config params_to_remove.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/evaluation/recipes/examples/example_eval.yaml:
- Line 13: The commented usage command references the wrong filename; update the
comment line that currently calls
"recipes/examples/quantization_validation.yaml" so it instead references
"recipes/examples/example_eval.yaml" (the current file name) to avoid
file-not-found errors when users copy/paste the command; ensure the rest of the
command string remains unchanged.
In @.claude/skills/evaluation/recipes/tasks/livecodebench.yaml:
- Around line 8-12: Update the usage example to include the two required
overrides missing from the nel run command: add -o
execution.output_dir=/path/to/output (or a placeholder) and -o
deployment.served_model_name=<model_name> so the command supplies values for
execution.output_dir and deployment.served_model_name referenced in the YAML;
ensure the placeholders match the pattern used for other overrides (e.g.,
/path/to/output and <model_name>).
- Line 33: Remove --trust-remote-code from the default extra_args entry (the
line setting extra_args: --max-model-len 32768 --trust-remote-code) and instead
add an explicit, optional configuration flag (e.g., trust_remote_code: false or
enable_trust_remote_code: false) that users can opt into; update
livecodebench.yaml to use only --max-model-len in extra_args and document/read
the new trust_remote_code option before appending --trust-remote-code at runtime
when true. Apply the same change pattern to the other recipe files mentioned
(aime2025, gpqa, ifbench, mmlu, mmlu_pro, scicode) so none enable
--trust-remote-code by default and the code path that builds command-line args
checks the new trust_remote_code flag to conditionally append
--trust-remote-code.
In @.claude/skills/evaluation/SKILL.md:
- Around line 43-44: The shortcut wording currently says to "skip Steps 2-5"
which incorrectly bypasses Step 3 (quantization/model handling) and Step 4
(filling required `???` fields); update the text around the pre-built recipe
shortcut so it only skips base-config/task-confirmation steps but explicitly
retains execution of model/quantization handling and placeholder/`???`
completion before jumping to Step 7.5/8; reference the recipe lookup
(recipes/tasks/) and explicitly call out running the model handling flow (Step
3) and placeholder filling (Step 4) when a recipe is used.
---
Nitpick comments:
In @.claude/skills/evaluation/recipes/tasks/aime2025.yaml:
- Line 40: The request_timeout currently set to 100000 seconds greatly exceeds
the SLURM walltime; change the YAML key request_timeout to a value at or
slightly below the SLURM walltime (e.g., set request_timeout: 14000 to align
with a 4h walltime minus a buffer) so stuck calls fail before the job walltime
expires and diagnostic signals are timely.
In @.claude/skills/evaluation/recipes/tasks/gpqa.yaml:
- Around line 13-60: This recipe duplicates the common scaffolding under keys
defaults, execution, deployment, and evaluation; extract that baseline into a
shared base YAML (e.g., chat-base or completions-base) and make this file only
contain task-specific overrides (keep the tasks list with name: ns_gpqa and its
nemo_evaluator_config) by replacing the duplicated sections with a reference to
the shared base (via YAML include/anchor or your repo's recipe inheritance
mechanism) and only override checkpoint_path, hf_model_handle,
served_model_name, extra_args, tensor_parallel_size/data_parallel_size, and the
ns_gpqa-specific nemo_evaluator_config extras; ensure the merged result
preserves env_vars HF_TOKEN and the adapter_config params_to_remove.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ea292879-464b-4698-9c32-abacd70a6689
📒 Files selected for processing (9)
.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/recipes/examples/example_eval.yaml.claude/skills/evaluation/recipes/tasks/aime2025.yaml.claude/skills/evaluation/recipes/tasks/gpqa.yaml.claude/skills/evaluation/recipes/tasks/ifbench.yaml.claude/skills/evaluation/recipes/tasks/livecodebench.yaml.claude/skills/evaluation/recipes/tasks/mmlu.yaml.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml.claude/skills/evaluation/recipes/tasks/scicode.yaml
00f37de to
3d6039a
Compare
3d6039a to
b805c9a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.claude/skills/evaluation/SKILL.md (1)
43-44:⚠️ Potential issue | 🟠 MajorShortcut instruction still skips mandatory completion steps.
Line 43 still says to skip Steps 2–5, which bypasses Step 3 (model/quantization handling) and Step 4 (filling required
???values), so recipe-based runs can remain incomplete.Suggested wording update
-**Shortcut: use a pre-built recipe.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching recipe. Available: mmlu, mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. If found, skip Steps 2-5 — go directly to the recipe, fill in deployment overrides, and proceed to Step 7.5/8. +**Shortcut: use a pre-built recipe.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching recipe. Available: mmlu, mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. If found, skip Step 2 (base config build) and Step 5 (task confirmation), start from the matched recipe, then run Step 3 (model path + quantization detection) and Step 4 (fill remaining `???` values), and proceed to Step 7.5/8.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/SKILL.md around lines 43 - 44, The current shortcut sentence ("If the user asks... skip Steps 2-5") incorrectly allows recipe-based runs to bypass mandatory Steps 3 (model/quantization handling) and 4 (filling required `???` values); update the wording so that when a matching recipe in recipes/tasks/ (e.g., mmlu, mmlu_pro, gpqa, ...) is used you still require completion of Steps 3 and 4 (and any other mandatory steps) before proceeding to Step 7.5/8 — keep the shortcut to locate and use the recipe but explicitly state that model/quantization selection (Step 3) and filling required placeholders (Step 4) must still be performed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 43-44: The current shortcut sentence ("If the user asks... skip
Steps 2-5") incorrectly allows recipe-based runs to bypass mandatory Steps 3
(model/quantization handling) and 4 (filling required `???` values); update the
wording so that when a matching recipe in recipes/tasks/ (e.g., mmlu, mmlu_pro,
gpqa, ...) is used you still require completion of Steps 3 and 4 (and any other
mandatory steps) before proceeding to Step 7.5/8 — keep the shortcut to locate
and use the recipe but explicitly state that model/quantization selection (Step
3) and filling required placeholders (Step 4) must still be performed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d1ca2c71-a047-4623-8b07-6feb01c655be
📒 Files selected for processing (9)
.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/recipes/examples/example_eval.yaml.claude/skills/evaluation/recipes/tasks/aime2025.yaml.claude/skills/evaluation/recipes/tasks/gpqa.yaml.claude/skills/evaluation/recipes/tasks/ifbench.yaml.claude/skills/evaluation/recipes/tasks/livecodebench.yaml.claude/skills/evaluation/recipes/tasks/mmlu.yaml.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml.claude/skills/evaluation/recipes/tasks/scicode.yaml
✅ Files skipped from review due to trivial changes (3)
- .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
- .claude/skills/evaluation/recipes/tasks/aime2025.yaml
- .claude/skills/evaluation/recipes/examples/example_eval.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- .claude/skills/evaluation/recipes/tasks/ifbench.yaml
- .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
- .claude/skills/evaluation/recipes/tasks/mmlu.yaml
- .claude/skills/evaluation/recipes/tasks/scicode.yaml
| tensor_parallel_size: 1 | ||
| data_parallel_size: 1 | ||
| # For models with custom code, add: --trust-remote-code | ||
| extra_args: --max-model-len 32768 |
There was a problem hiding this comment.
Can we let the agent decide the extra_args required based on the model card/config of the checkpoint? e.g., model-len, tool-call-parser, reasoning-parser ...
There was a problem hiding this comment.
I've added a section for auto-detecting deployment settings from checkpoint.
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
b805c9a to
a702423
Compare
I've added an env.example with all possible API key. |
Signed-off-by: Kai Xu <kaix@nvidia.com>
a702423 to
0f608a8
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/evaluation/recipes/tasks/ifbench.yaml:
- Around line 8-14: The usage snippet in ifbench.yaml omits required auth
environment variables (HF_TOKEN and DUMMY_API_KEY) referenced elsewhere (the
recipe's deployment/serving steps around the blocks that read HF_TOKEN at lines
~29, DUMMY_API_KEY at ~39 and ~48); update the Usage example to show setting
these before running (either export HF_TOKEN=... and export DUMMY_API_KEY=... or
prefix the nel run command with HF_TOKEN=... DUMMY_API_KEY=...), so users have
the required credentials available when invoking the recipe.
In @.claude/skills/evaluation/SKILL.md:
- Around line 136-140: The doc currently implies automatically enabling
--trust-remote-code when auto_map exists; change this to explicitly warn against
auto-enabling and instruct readers to require explicit user confirmation before
setting --trust-remote-code. Update the table row referencing `auto_map` and
`--trust-remote-code` to state “Do not enable by default; require explicit
confirmation and verification of model provenance (trusted source)”, and add a
short advisory note referencing vLLM security best practices and the RCE risks
(auto_map) so maintainers/users must opt-in after verification.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e8bee44d-b8a1-4aed-b7b5-83455132f183
📒 Files selected for processing (10)
.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/recipes/env.example.claude/skills/evaluation/recipes/examples/example_eval.yaml.claude/skills/evaluation/recipes/tasks/aime2025.yaml.claude/skills/evaluation/recipes/tasks/gpqa.yaml.claude/skills/evaluation/recipes/tasks/ifbench.yaml.claude/skills/evaluation/recipes/tasks/livecodebench.yaml.claude/skills/evaluation/recipes/tasks/mmlu.yaml.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml.claude/skills/evaluation/recipes/tasks/scicode.yaml
✅ Files skipped from review due to trivial changes (8)
- .claude/skills/evaluation/recipes/env.example
- .claude/skills/evaluation/recipes/examples/example_eval.yaml
- .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
- .claude/skills/evaluation/recipes/tasks/scicode.yaml
- .claude/skills/evaluation/recipes/tasks/aime2025.yaml
- .claude/skills/evaluation/recipes/tasks/gpqa.yaml
- .claude/skills/evaluation/recipes/tasks/mmlu.yaml
- .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
| # Usage: | ||
| # nel run --config recipes/tasks/ifbench.yaml \ | ||
| # -o deployment.checkpoint_path=/path/to/checkpoint \ | ||
| # -o execution.hostname=<slurm_host> \ | ||
| # -o execution.account=<slurm_account> \ | ||
| # -o execution.output_dir=/path/to/output \ | ||
| # -o deployment.served_model_name=<model_name> |
There was a problem hiding this comment.
Usage example misses required environment setup.
The command snippet omits HF_TOKEN/DUMMY_API_KEY setup even though this recipe depends on them (Lines 29, 39, 48). A direct copy-paste run can fail on auth.
Suggested usage block update
# Usage:
+# cp recipes/env.example .env
+# # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
+# set -a && source .env && set +a
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Usage: | |
| # nel run --config recipes/tasks/ifbench.yaml \ | |
| # -o deployment.checkpoint_path=/path/to/checkpoint \ | |
| # -o execution.hostname=<slurm_host> \ | |
| # -o execution.account=<slurm_account> \ | |
| # -o execution.output_dir=/path/to/output \ | |
| # -o deployment.served_model_name=<model_name> | |
| # Usage: | |
| # cp recipes/env.example .env | |
| # # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...) | |
| # set -a && source .env && set +a | |
| # nel run --config recipes/tasks/ifbench.yaml \ | |
| # -o deployment.checkpoint_path=/path/to/checkpoint \ | |
| # -o execution.hostname=<slurm_host> \ | |
| # -o execution.account=<slurm_account> \ | |
| # -o execution.output_dir=/path/to/output \ | |
| # -o deployment.served_model_name=<model_name> |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/evaluation/recipes/tasks/ifbench.yaml around lines 8 - 14,
The usage snippet in ifbench.yaml omits required auth environment variables
(HF_TOKEN and DUMMY_API_KEY) referenced elsewhere (the recipe's
deployment/serving steps around the blocks that read HF_TOKEN at lines ~29,
DUMMY_API_KEY at ~39 and ~48); update the Usage example to show setting these
before running (either export HF_TOKEN=... and export DUMMY_API_KEY=... or
prefix the nel run command with HF_TOKEN=... DUMMY_API_KEY=...), so users have
the required credentials available when invoking the recipe.
| | Field in `config.json` | What to set | Example | | ||
| | --- | --- | --- | | ||
| | `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` | | ||
| | `auto_map` exists | `--trust-remote-code` | Only add if model has custom code | | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In vLLM, what are the security implications of using --trust-remote-code, and is it recommended to enable it by default based only on Hugging Face config.json auto_map?
💡 Result:
Using --trust-remote-code in vLLM allows execution of untrusted Python code from Hugging Face model repositories during model loading and tokenizer initialization. This poses severe security risks, including remote code execution (RCE), as attackers can craft malicious models that execute arbitrary code on the host system when loaded. Multiple CVEs highlight this danger, even when the flag is intended to be False: - CVE-2025-66448 and GHSA-8fr4-5q9j-m8gm: auto_map in config.json triggers get_class_from_dynamic_module, fetching and executing remote code bypassing trust_remote_code=False (fixed in v0.11.1+ with domain validation). - CVE-2026-27893 and GHSA-7972-pg2x-xr59: Hardcoded trust_remote_code=True in model files (e.g., Nemotron-VL, Kimi-K25) overrides user setting (fixed in v0.18.0+). - GHSA-2pc9-4j83-qjmr: Unconditional auto_map resolution without trust_remote_code check (fixed in v0.14.0). Official vLLM docs confirm the flag defaults to False in both LLM class and vllm serve CLI, explicitly for trusting remote code from Hugging Face. It is not recommended to enable --trust-remote-code by default, especially based solely on config.json auto_map, as this field has been exploited for RCE. Only enable explicitly for trusted models requiring custom code, after verifying the repository. Use the latest vLLM version (e.g., v0.20.0 as of 2026-04-27) with patches applied, and prefer models without remote code needs.
Citations:
- 1: https://advisories.gitlab.com/pkg/pypi/vllm/CVE-2026-27893/
- 2: https://zeropath.com/blog/cve-2025-66448-vllm-rce-automap
- 3: https://www.thehackerwire.com/vllm-rce-via-trust-remote-code-bypass-cve-2026-27893/
- 4: GHSA-7972-pg2x-xr59
- 5: https://www.reddit.com/r/LocalLLaMA/comments/1s72zog/vllm_cve202627893_trustremotecodefalse_is/
- 6: https://docs.vllm.ai/en/v0.4.3/dev/offline_inference/llm.html
- 7: GHSA-2pc9-4j83-qjmr
- 8: GHSA-8fr4-5q9j-m8gm
- 9: https://www.github.com/vllm-project/vllm
- 10: https://docs.vllm.ai/en/stable/cli/serve.html
- 11: https://docs.vllm.ai/en/latest/configuration/serve_args/
Don't auto-enable --trust-remote-code based on auto_map without explicit user confirmation.
Automatically enabling this flag when auto_map exists in config.json creates a remote code execution (RCE) risk. The vLLM security advisories (CVE-2025-66448, CVE-2026-27893, GHSA-8fr4-5q9j-m8gm) document multiple instances where attackers exploited auto_map to execute arbitrary code during model loading. Official vLLM documentation explicitly recommends keeping this flag disabled by default. Only enable after explicit user confirmation and verification that the model is from a trusted source.
Suggested wording adjustment
-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | Field in `config.json` | What to set | Example | | |
| | --- | --- | --- | | |
| | `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` | | |
| | `auto_map` exists | `--trust-remote-code` | Only add if model has custom code | | |
| | Field in `config.json` | What to set | Example | | |
| | --- | --- | --- | | |
| | `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` | | |
| | `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/evaluation/SKILL.md around lines 136 - 140, The doc currently
implies automatically enabling --trust-remote-code when auto_map exists; change
this to explicitly warn against auto-enabling and instruct readers to require
explicit user confirmation before setting --trust-remote-code. Update the table
row referencing `auto_map` and `--trust-remote-code` to state “Do not enable by
default; require explicit confirmation and verification of model provenance
(trusted source)”, and add a short advisory note referencing vLLM security best
practices and the RCE risks (auto_map) so maintainers/users must opt-in after
verification.
| target: | ||
| api_endpoint: | ||
| api_key_name: DUMMY_API_KEY | ||
| tasks: |
There was a problem hiding this comment.
What's the reason for keeping one yaml file for each task? Can we put them together and let AI agent compose the target set of benchmarks? Also, for other benchmarks like tau2? Can AI agent compose a working config without an example?
There was a problem hiding this comment.
I agree we need to let the agent compose the target set of benchmarks. So it's more flexible to run a single task directly or compose them into a suite by copying recipes/examples/example_eval.yaml. Keep one working config may not be flexible since some configs are not needed by some users.
There was a problem hiding this comment.
What do you think if we only keep the tasks part for these tasks/< benchmark>.yaml since other config should be the same across benchmarks? It will reduce token usage and keep the other setup consistent across benchmarks.
There was a problem hiding this comment.
This is a good suggestion. I've stripped each task file to just the task config, and create one shared base config.
b3219f0 to
98f4a97
Compare
…ase config Signed-off-by: Kai Xu <kaix@nvidia.com>
98f4a97 to
4e5db92
Compare
|
The current default repeats is not enough, this will produce mostly noise. Suggestion, AIME->64, GPQA->10, LCB->10. |
| @@ -0,0 +1,122 @@ | |||
| # Example: Quantization Validation Suite | |||
There was a problem hiding this comment.
NEL config yaml may change quite frequently. Is this yaml for demo purpose or for day 0 model evals (internal usage)?
Also some of the evals requires pinned eval docker image and specific settings for apple-to-apple comparison.
There was a problem hiding this comment.
These recipes are for demo purpose. The task snippets that the agent composes into a working config. If NEL configs change and something breaks, the agent will diagnose and fix the incompatibility at runtime.
There was a problem hiding this comment.
For evals that require pinned docker images and specific settings for apple-to-apple comparison, users can override container.
I followed the NV_eval internal example recipe and agree that increasing repetitions improves robustness to noise. It might be useful to let users adjust the number of repetitions themselves if the current setting feels too noisy, similar to how Claude Code allows changing repetitions via chat. |
Yeah, the current value is not useful for our dashboard, if we want to track ptq loss, it will be covered by noise. so we should at least add a note to agent to warn users, saying something like "The default repeats will give benchmarks with large variation, it is used for quick verification of the model's ability, if you want precise benchmark values, I suggest use much larger number of repeats, but that will take much longer time to eval". |
What does this PR do?
Type of change: ?
Usage
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Documentation