-
Notifications
You must be signed in to change notification settings - Fork 375
Add pre-built evaluation recipes for common benchmarks #1357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b42544d
585a7a1
0f608a8
4e5db92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # Evaluation API Keys | ||
| # | ||
| # Copy this file and fill in the keys you need: | ||
| # cp recipes/env.example .env | ||
| # # Edit .env with your keys | ||
| # set -a && source .env && set +a | ||
| # | ||
| # Not all keys are required — only fill in what your tasks need. | ||
|
|
||
| # Required for all tasks (model/dataset downloads) | ||
| HF_TOKEN=hf_... | ||
|
|
||
| # Required for nemo_skills.* tasks (dummy value, not a real key) | ||
| DUMMY_API_KEY=dummy | ||
|
|
||
| # Required for NEL pre_cmd execution | ||
| NEMO_EVALUATOR_TRUST_PRE_CMD=1 | ||
|
|
||
| # --- Optional: task-specific keys --- | ||
|
|
||
| # AIME 2025 (simple_evals variant only, not ns_aime2025) | ||
| # JUDGE_API_KEY= | ||
|
|
||
| # tau2_bench_telecom (LLM judge) | ||
| # JUDGE_API_KEY_NVDEV_QWEN235B= | ||
|
|
||
| # terminal-bench-hard (AWS sandbox) | ||
| # AWS_ACCESS_KEY_ID= | ||
| # AWS_SECRET_ACCESS_KEY= |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # Example: Quantization Validation Suite | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NEL config yaml may change quite frequently. Is this yaml for demo purpose or for day 0 model evals (internal usage)? Also some of the evals requires pinned eval docker image and specific settings for apple-to-apple comparison.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These recipes are for demo purpose. The task snippets that the agent composes into a working config. If NEL configs change and something breaks, the agent will diagnose and fix the incompatibility at runtime.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For evals that require pinned docker images and specific settings for apple-to-apple comparison, users can override container. |
||
| # | ||
| # A balanced set of benchmarks for validating quantized model quality. | ||
| # Copy this file and customize for your needs. | ||
| # Task snippets in recipes/tasks/ define per-task configs — the agent | ||
| # composes them into a runnable config like this one. | ||
| # | ||
| # Includes: | ||
| # - MMLU-Pro (knowledge, completions) | ||
| # - GPQA Diamond (reasoning, chat, 5 repeats) | ||
| # - LiveCodeBench v6 (code, chat, 3 repeats) | ||
| # - IFBench (instruction following, chat, 8 repeats) | ||
| # | ||
| # Usage: | ||
| # nel run --config recipes/examples/example_eval.yaml \ | ||
| # -o deployment.checkpoint_path=/path/to/quantized/checkpoint \ | ||
| # -o deployment.served_model_name=my-model-nvfp4 \ | ||
| # -o execution.hostname=<slurm_host> \ | ||
| # -o execution.account=<slurm_account> \ | ||
| # -o execution.output_dir=/path/to/output | ||
| # | ||
| # For quantized checkpoints, also add the quantization flag: | ||
| # -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code --quantization modelopt_fp4' | ||
| # | ||
| # Run a single task: | ||
| # nel run --config ... -t ns_gpqa | ||
| # | ||
| # Smoke test (2 samples): | ||
| # nel run --config ... -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=2 | ||
| defaults: | ||
| - execution: slurm/default | ||
| - deployment: vllm | ||
| - _self_ | ||
| execution: | ||
| hostname: ??? | ||
| username: ${oc.env:USER} | ||
| account: ??? | ||
| output_dir: ??? | ||
| walltime: "04:00:00" | ||
| mounts: | ||
| mount_home: false | ||
| deployment: | ||
| env_vars: | ||
| HF_TOKEN: host:HF_TOKEN | ||
| checkpoint_path: ??? | ||
| hf_model_handle: | ||
| served_model_name: ??? | ||
| tensor_parallel_size: 1 | ||
| data_parallel_size: 1 | ||
| # For models with custom code, add: --trust-remote-code | ||
| extra_args: --max-model-len 32768 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we let the agent decide the extra_args required based on the model card/config of the checkpoint? e.g., model-len, tool-call-parser, reasoning-parser ...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a section for auto-detecting deployment settings from checkpoint. |
||
| evaluation: | ||
| env_vars: | ||
| HF_TOKEN: host:HF_TOKEN | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| request_timeout: 3600 | ||
| max_retries: 10 | ||
| parallelism: 16 | ||
| target: | ||
| api_endpoint: | ||
| api_key_name: DUMMY_API_KEY | ||
| tasks: | ||
| # Knowledge (chat endpoint, short) | ||
| - name: ns_mmlu_pro | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 1 | ||
| args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
|
|
||
| # Reasoning (chat endpoint, 5 repeats, short) | ||
| - name: ns_gpqa | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| args: ++prompt_config=eval/aai/mcq-4choices | ||
| num_repeats: 5 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
|
|
||
| # Code (chat endpoint, 3 repeats, medium) | ||
| - name: ns_livecodebench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| dataset_split: test_v6_2408_2505 | ||
| num_repeats: 3 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
|
|
||
| # Instruction following (chat endpoint, 8 repeats, super short) | ||
| - name: ns_ifbench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 8 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # AIME 2025 (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-16] symbolic_correct | ||
| # Run time: Long (reasoning models generate lengthy thinking traces) | Repeats: 16 | ||
| # Note: The AA variant (simple_evals.AIME_2025) requires JUDGE_API_KEY. | ||
| # This NeMo Skills variant uses symbolic scoring — no external API keys needed. | ||
| - name: ns_aime2025 | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| request_timeout: 100000 | ||
| max_retries: 10 | ||
| extra: | ||
| num_repeats: 16 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # GPQA Diamond (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-5] symbolic_correct | ||
| # Run time: Short | Repeats: 5 | ||
| - name: ns_gpqa | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| args: ++prompt_config=eval/aai/mcq-4choices | ||
| num_repeats: 5 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # IFBench (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-8] prompt_strict_accuracy | ||
| # Run time: Super Short | Repeats: 8 | ||
| - name: ns_ifbench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 8 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # LiveCodeBench v6 (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-3] accuracy | ||
| # Run time: Medium | Repeats: 3 | ||
| - name: ns_livecodebench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| max_retries: 10 | ||
| extra: | ||
| dataset_split: test_v6_2408_2505 | ||
| num_repeats: 3 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # MMLU-Pro (NeMo Skills, chat) | ||
| # Primary metric: symbolic_correct | ||
| # Run time: Short | Repeats: 1 | ||
| - name: ns_mmlu_pro | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 1 | ||
| args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # SciCode (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-3] subtask_accuracy | ||
| # Run time: Long | Repeats: 3 | ||
| - name: ns_scicode | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| max_retries: 10 | ||
| extra: | ||
| num_repeats: 3 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
In vLLM, what are the security implications of using--trust-remote-code, and is it recommended to enable it by default based only on Hugging Faceconfig.jsonauto_map?💡 Result:
Using --trust-remote-code in vLLM allows execution of untrusted Python code from Hugging Face model repositories during model loading and tokenizer initialization. This poses severe security risks, including remote code execution (RCE), as attackers can craft malicious models that execute arbitrary code on the host system when loaded. Multiple CVEs highlight this danger, even when the flag is intended to be False: - CVE-2025-66448 and GHSA-8fr4-5q9j-m8gm: auto_map in config.json triggers get_class_from_dynamic_module, fetching and executing remote code bypassing trust_remote_code=False (fixed in v0.11.1+ with domain validation). - CVE-2026-27893 and GHSA-7972-pg2x-xr59: Hardcoded trust_remote_code=True in model files (e.g., Nemotron-VL, Kimi-K25) overrides user setting (fixed in v0.18.0+). - GHSA-2pc9-4j83-qjmr: Unconditional auto_map resolution without trust_remote_code check (fixed in v0.14.0). Official vLLM docs confirm the flag defaults to False in both LLM class and vllm serve CLI, explicitly for trusting remote code from Hugging Face. It is not recommended to enable --trust-remote-code by default, especially based solely on config.json auto_map, as this field has been exploited for RCE. Only enable explicitly for trusted models requiring custom code, after verifying the repository. Use the latest vLLM version (e.g., v0.20.0 as of 2026-04-27) with patches applied, and prefer models without remote code needs.
Citations:
Don't auto-enable
--trust-remote-codebased onauto_mapwithout explicit user confirmation.Automatically enabling this flag when
auto_mapexists inconfig.jsoncreates a remote code execution (RCE) risk. The vLLM security advisories (CVE-2025-66448, CVE-2026-27893, GHSA-8fr4-5q9j-m8gm) document multiple instances where attackers exploited auto_map to execute arbitrary code during model loading. Official vLLM documentation explicitly recommends keeping this flag disabled by default. Only enable after explicit user confirmation and verification that the model is from a trusted source.Suggested wording adjustment
📝 Committable suggestion
🤖 Prompt for AI Agents