[DISCLAIMER - EXPERIMENTAL][First iteration of MVP v0.1, shipping second iteration next week] [WIP] feat: agentic benchmark by cquil11 · Pull Request #1103 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-20T20:41:31Z

DISCLAIMER : experimental

github-actions · 2026-04-20T20:41:41Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

- Add benchmarks/single_node/agentic/ with trace replay scripts for B200 FP4, H200 FP8, MI355X FP4/FP8 - Add utils/agentic-benchmark/ with metrics collector, analysis scripts, and Pareto frontier plotting - Scripts reference utils/trace-replay (submodule) and utils/agentic-benchmark (support utilities) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add scenario-type input to benchmark-tmpl.yml (default: fixed-seq-len). When scenario-type is agentic-coding, SCENARIO_SUBDIR routes to benchmarks/single_node/agentic/ instead of benchmarks/single_node/. All 12 runner scripts updated to use ${SCENARIO_SUBDIR} in script paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add dsr1-fp4-b200-vllm-agentic (nvidia) and dsr1-fp4-mi355x-vllm-agentic (amd) with agentic-coding scenarios. Remove trace-source from AgenticCodingConfig model (handled by scripts). Ported from experimental multiturn-agentic-trace.yaml B200/MI355X DSR1 configs with cpu-offloading on/off variants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename multiturn_* to dsr1_* in benchmarks/single_node/agentic/ and update model-prefix from 'multiturn' to 'dsr1' in master configs so runner script path construction works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Extract common agentic functions to benchmark_lib.sh (resolve_trace_source, install_agentic_deps, start/stop metrics collector, build_replay_cmd, trim_idle_metrics) - Refactor all 4 agentic scripts to use shared helpers - Remove --max-ttft and --max-new-tokens-per-period from replay command - Remove vLLM version check and commented-out config blocks - Rename model-prefix from 'multiturn' to 'dsr1' in master configs - Rename config keys from *-vllm-agentic to *-vllm - Switch submodule branch from neon-trace-support to agentx Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add AgenticMatrixEntry Pydantic model with users, offload-mode, scenario-type fields - Implement agentic-coding matrix generation in generate_full_sweep() and generate_test_config_sweep() - Skip agentic entries in mark_eval_entries() (no eval support) - Generates 46 entries for B200 + 22 for MI355X from the agentic configs Matrix entries include scenario-type: agentic-coding which the benchmark template uses to route to benchmarks/single_node/agentic/ scripts via SCENARIO_SUBDIR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add sweep-agentic job group in run-sweep.yml that dispatches agentic matrix entries to benchmark-tmpl.yml with scenario-type: agentic-coding - Add offload-mode and total-cpu-dram-gb inputs to benchmark-tmpl.yml - Add USERS, OFFLOAD_MODE, TOTAL_CPU_DRAM_GB env vars to template - Route agentic entries to single_node['agentic'] in process_changelog.py - Update ChangelogMatrixEntry to accept AgenticMatrixEntry in single_node Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Skip process_result.py and fixed-seq-len result file check for agentic - Check status.txt for agentic scenario success/failure - Add dedicated artifact upload step for agentic results (metrics CSV, detailed_results, debug_trace, workload distributions, etc.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Produces bmk_agentic_*.json artifacts matching the naming convention of fixed-seq-len results. Includes: - QPS statistics (mean, median, p90, p99, p99.9) - Latency statistics (TTFT, TTLT, ITL with percentiles) - Workload distribution (input/output token stats) - KV cache hit rates (server-reported GPU/CPU and theoretical infinite) - Throughput (total, per-GPU, input/output split) - Request success counts Wired into benchmark-tmpl.yml as "Process agentic result" step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…en unavailable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add agentic-config output to get-jobs step - Filter agentic entries out of single-node bucket - Add test-sweep-agentic job group with scenario-type routing Enables running agentic benchmarks via: gh workflow run e2e-tests.yml --ref chore/agentx-integration \ -f generate-cli-command='test-config --config-keys dsr1-fp4-b200-vllm --config-files .github/configs/nvidia-master.yaml --scenario-type agentic-coding' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vllm/vllm-openai:v0.19.1 is CUDA-only. MI355X needs vllm/vllm-openai-rocm:v0.19.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use ${!var_name:-} to avoid 'unbound variable' error when scripts use set -u (set -euo pipefail). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Agentic scripts expect RESULT_DIR but it wasn't set at the workflow level. Fixed-seq-len scripts set it internally via the runner, but agentic scripts need it from the environment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The trace-replay submodule wasn't being checked out, causing 'requirements.txt not found' errors on all agentic jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add check_agentic_success() to benchmark_lib.sh — verifies detailed_results.csv has >0 successful requests before writing SUCCESS - Add --max-consecutive-errors 10 to replay command - Skip agentic entries in summarize.py to avoid KeyError on 'isl' - Update all 4 agentic scripts to use check_agentic_success Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Include kv_offload_bytes/time (gpu_to_cpu, cpu_to_gpu) and cpu_kv_cache_usage_pct when available. All fields default to null when offloading is not active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove cpu-offloading: false for MI355X (vLLM engine crashes on DSR1 FP4 without offloading on ROCm) - Add always() condition to agentic raw results upload so artifacts are captured even on failure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This is an internal attribute of TestOrchestrator, not a CLI argument. The trace replayer crashes with 'unrecognized arguments'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

B200 DGXC-SLURM runners set MODEL to a local path like /scratch/fsw/models/DeepSeek-R1-0528-FP4. hf download fails on local paths with 'Invalid value. Repo id must be in the form...' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…SLURM The B200 DGXC-SLURM runner rewrites MODEL from HF repo ID to a local path (e.g., nvidia/DeepSeek-R1-0528-FP4 -> /scratch/fsw/models/DeepSeek-R1-0528-FP4). vLLM rejects this as an invalid HF repo ID. Skip the rewrite for agentic-coding scenarios so vLLM can resolve the model from HF cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents DeepSeek R1's <think> block from consuming the entire max_tokens budget and producing invisible output tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Also revert temp mi355x 10-min duration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ture Adds a --debug-trace boolean input to e2e-tests.yml (workflow_dispatch + workflow_call), forwards it to the agentic / multi-node-agentic child jobs, and exposes it on benchmark-tmpl.yml + benchmark-multinode-tmpl.yml as DEBUG_TRACE in the runner env. benchmark_lib.sh already converts DEBUG_TRACE=true into the replayer's --debug-trace flag, and the debug_trace.jsonl path is already in the agentic_* artifact upload list, so no shell or upload-step changes are needed. Bumps utils/trace-replay to 07a1d556 which extends --debug-trace to also record prompt_token_ids (via apply_chat_template) and completion_token_ids (via streaming logprobs) per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lets workflow_dispatch supply a duration in seconds that overrides matrix.config.duration for both agentic child workflows. Empty default preserves existing behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps trace-replay to 2c84ea4a which removes the flag and the Colors.disable() helper. No callers remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule 3cd4c2f removes the Colors / ColoredFormatter machinery (driven by the now-gone --no-color flag) and the qwen3-coder-only MODEL_DEFAULTS table that no current sweep matches. Net -87 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule 943aaa6 removes users_added, rate_limit_events, and admission_blocked_events from PeriodMetrics — no consumer reads them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule d21f18d replaces the cherry-picked debug_chunks dict with chunk.model_dump() so every SDK-exposed field of each streaming chunk lands in debug_trace.jsonl. Needed to diagnose gpt-oss requests where content/reasoning_content are None throughout but tokens are clearly being generated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule cef04ad removes the request-rate-limiting path entirely and makes print_assessment emit one log record per line so each line gets stamped with [time] LEVEL prefix consistently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule 29f353c adds delta_field_names_for(model) so streaming-chunk parsing uses the right field per model. gpt-oss now resolves to delta.reasoning instead of delta.reasoning_content, fixing the silent 0-output-tokens bug for reasoning-only responses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule 6efd07e fixes the long-standing ISL=2 metric for gpt-oss by sourcing input_tokens from the server's usage.prompt_tokens instead of the local apply_chat_template count, which gpt-oss's harmony chat template breaks. DSR1/Qwen/etc. are functionally unchanged (their templates render correctly anyway). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an agentic-coding search-space to the existing kimik2.5-fp4-b200-vllm config (TP=4 and TP=8, conc 1..128 at TP=8) and a corresponding benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh launcher modeled on the kimi fixed-seq-len launcher (kimi_k2 reasoning + tool parsers, fp8 KV, allreduce-rms fusion, max-cudagraph-capture-size 2048) plus the agentic harness from gptoss_fp4_h100.sh (resolve traces, install deps, OFFLOADING none/cpu switch, build_replay_cmd, write_agentic_result_json, analyze_benchmark_distributions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches the rename done on main in PR #1192 (commit 100a5ec). The actual GitHub runner.name reports as 'b200-dgxc_NN', so \${RUNNER_NAME%%_*} produces 'b200-dgxc' — the launcher must be named launch_b200-dgxc.sh for the workflow's launcher-selection step to find it. Without this rename, every job scheduled onto a b200-dgxc runner fails immediately with "No such file or directory". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops this PR's test-file diffs (-378 lines): - utils/test_process_agentic_result.py (new in branch) — deleted - utils/matrix_logic/test_generate_sweep_configs.py — reverted to main - utils/matrix_logic/test_validation.py — reverted to main - utils/test_process_result.py — reverted to main Also drops 'branch = agentx-minimized' from .gitmodules so the submodule pointer (the recorded commit SHA in the parent tree) is the only source of truth — the parent doesn't follow the floating branch HEAD anymore. The submodule still points at 3600d641, which is locked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverts the .gitmodules portion of 57bf46d. The recorded gitlink SHA already pins to a specific commit; the branch hint is fine to keep so git submodule update --remote works for intentional bumps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each gpt-oss agentic config previously enumerated [1,2,4,8,12,16,24,32,48,64] under both offloading=none and offloading=cpu. At conc<32 the active KV fits in HBM so cpu-offload only adds bookkeeping overhead — same answer as none, double GPU time. Restructure to: - offloading=none: low+mid concurrency [1, 2, 4, 8, 16, 32, 64] - offloading=cpu : mid+high [64, 96, 128, 192, 256] Overlap at conc=64 captures the crossover. Cuts ~50% of jobs per sweep across: - amd-master.yaml: gptoss-fp4-mi300x-vllm, gptoss-fp4-mi325x-vllm - nvidia-master.yaml: gptoss-fp4-h100-vllm, gptoss-fp4-h200-vllm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule 7ad6a9e adds the 'kimi' substring to _MODEL_DELTA_FIELDS so Kimi-K2.5 reasoning tokens are recognized via delta.reasoning instead of falling back to the (empty) reasoning_content default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

initial plumming

c12bc12

github-project-automation Bot added this to InferenceMAX Board Apr 20, 2026

github-code-quality Bot found potential problems Apr 20, 2026

View reviewed changes

cquil11 and others added 18 commits April 20, 2026 16:13

fix: always include cache hit rate fields in agentic results (null wh…

b5b7f6f

…en unavailable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: hardcode HF trace source to semianalysisai/cc-traces-weka-042026

1b3e92e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: wait for agentic jobs before collect-results in e2e-tests

07c5ba9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use ROCm vLLM image for MI355X agentic config

d5b023d

vllm/vllm-openai:v0.19.1 is CUDA-only. MI355X needs vllm/vllm-openai-rocm:v0.19.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: bump MI355X agentic image to vllm-openai-rocm:v0.19.1

1fc8373

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: handle unset vars in check_env_vars with set -u

c9058c0

Use ${!var_name:-} to avoid 'unbound variable' error when scripts use set -u (set -euo pipefail). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: enable submodule checkout in benchmark template

3502e64

The trace-replay submodule wasn't being checked out, causing 'requirements.txt not found' errors on all agentic jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-code-quality Bot found potential problems Apr 21, 2026

View reviewed changes

Comment thread utils/summarize.py Fixed

cquil11 and others added 6 commits April 20, 2026 19:23

feat: add KV offload stats to agentic result JSON

cb5774a

Include kv_offload_bytes/time (gpu_to_cpu, cpu_to_gpu) and cpu_kv_cache_usage_pct when available. All fields default to null when offloading is not active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove --max-consecutive-errors (not a CLI arg)

9e2b236

This is an internal attribute of TestOrchestrator, not a CLI argument. The trace replayer crashes with 'unrecognized arguments'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: enable --no-max-tokens by default for agentic benchmarks

c02b8f1

Prevents DeepSeek R1's <think> block from consuming the entire max_tokens budget and producing invisible output tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cquil11 and others added 25 commits April 23, 2026 22:01

test: temp 10-min duration for mi355x dsr1 agentic sanity test

89f7c2a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: warmup timeout to 900s

9d9737e

Also revert temp mi355x 10-min duration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: guard over-context prompts

d8e7a80

bump trace-replay: aligned-table assessment period format

fd90279

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drop --no-color flag from replayer invocation

5a9bf52

Bumps trace-replay to 2c84ea4a which removes the flag and the Colors.disable() helper. No callers remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: drop dead CSV columns

bf5e8fa

Submodule 943aaa6 removes users_added, rate_limit_events, and admission_blocked_events from PeriodMetrics — no consumer reads them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: period header counts up elapsed time

51498e9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: append reasoning to conversation history

153fadd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: per-user salt to break cross-user KV-cache match

437ee70

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: "Wait time" → "Inter-turn time" in period summary

9cadc71

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bump trace-replay: 5s quiesce between warmup and metrics start

874b1fe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cleanup coreweave runners

9813fe8

kimik2.5-fp4-b200-vllm: bump image to vllm/vllm-openai:v0.19.1

45d3c5f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 mentioned this pull request Apr 27, 2026

[disclaimer: MVP/experimental] feat: agentic trace replay benchmark MVP v0.1 #1201

Draft

cquil11 and others added 2 commits April 27, 2026 16:54

bump trace-replay: silence kimi tokenization warning flood

337fda0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 closed this Apr 27, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCLAIMER - EXPERIMENTAL][First iteration of MVP v0.1, shipping second iteration next week] [WIP] feat: agentic benchmark#1103

[DISCLAIMER - EXPERIMENTAL][First iteration of MVP v0.1, shipping second iteration next week] [WIP] feat: agentic benchmark#1103
cquil11 wants to merge 196 commits intomainfrom
chore/agentx-integration

cquil11 commented Apr 20, 2026 •

edited by functionstackx

Loading

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Apr 20, 2026 • edited by functionstackx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cquil11 commented Apr 20, 2026 •

edited by functionstackx

Loading