Skip to content

[disclaimer: MVP/experimental] feat: agentic trace replay benchmark MVP v0.1#1201

Draft
cquil11 wants to merge 1 commit intomainfrom
chore/agentx-v0.1
Draft

[disclaimer: MVP/experimental] feat: agentic trace replay benchmark MVP v0.1#1201
cquil11 wants to merge 1 commit intomainfrom
chore/agentx-v0.1

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 27, 2026

DISCLAIMER: Since AgentX is currently in MVP v0.1 phase , this can subject at anyone

AgentX MVP v0.1, new & improved MVP coming every week till we hit v1.0, i.e. v0.2 coming in an week, v0.3 coming in 2 weeks, etc.

Note: there are approximately ~4k lines of actual diff. The rest are from indentation refactors / backfill / etc.

Foreword: as a human reading this code here are the files that are actually important to read and verify line by line:

  • some of hte plumbing under .github/workflows
  • examples in benchmarks/
    • Please note these are experimental and may not be the optimal recipes for these workloads! This is just meant to give an example of what they're supposed to look like in the context of the changes.
  • metrics collector stuff in utils/agentic-benchmark/bench
  • everything in utils/trace-replay (the submodule) -- this is of most important as this is the core logic for trace replaying!!!

Everything else is largely not important and can be thought of as new age assembly.

This is not intended to be a final implementation rather it is an MVP. The goal is to merge this in as the foundation for the agentic trace replay benchmark and iterate fast as a community to converge to a final solution.

The dataset itself is expected to change as SemiAnalysis collects more representative production traces. Currently the traces are out of data and may inflate KV cache hit rate due to fewer subagent branches. Nevertheless, the general workload shapes are very similar (avg ISL / avg OSL, inter-turn latency, etc) and thus this dataset can be used for testing in the meantime.

The benchmark has so far been publicly verified on:

  • MI300X GPT-OSS
  • MI325X GPT-OSS
  • H100 GPT-OSS
  • H200 GPT-OSS
  • MI355X DeepSeek R1
  • B200 DeepSeek R1

The actual trace replaying logic is contained in the trace replayer submodule. This points to a minimized branch of Callan Fox's (WEKA) https://github.com/callanjfox/kv-cache-tester -- specifically it is this branch. It is undecided whether this will remain a submodule or be pulled directly into upstream. The code is relatively simple. It expects a set of trace JSON objects and replays them independently subject to some level of concurrency. This code is subject to change as we iterate from this v0.1 version.

Full RFC

DRAFT: InferenceX AgentX RFC [MVP v0.1]

Introduction

InferenceX currently only has single turn, "fixed sequence length" scenarios (8k1k, 1k1k). These scenarios all use random data and are isolated single turn requests, thus they do not benefit from prefix caching. Prefix caching is the default in production systems as it significantly increases input token throughput and decreases TTFT in multi turn scenarios by caching computed key and value vectors for each request. Current InferenceX results can be thought of as a baseline, i.e., how efficiently can chip X + engine Y serve model Z (possibly with some extra optimizations like speculative decoding) with some fixed input/output tensor shapes (with some slight variance).

Relevant code: #1103

Agentic Coding Workloads

The current "baseline" benchmark is still useful, as results are a strong indication of the raw end-to-end capability of the chips + SW stacks. However, InferenceX will benefit from additional scenarios that benchmark specific real-world workloads such as agentic coding, which is the topic of this RFC. With the rise of coding harnesses such as Claude Code, Codex, Cursor, OpenCode, etc., agentic workflows have dominated. These workflows are characterized by:

  • Multi-turn, with significantly higher turn count per conversation when compared to basic multiturn chat (ChatGPT/Claude chatbot apps)
    • This is due to the agent harness acting as both the user and the assistant to "prompt itself" without user interaction
  • Long context that grows linearly with respect to turns
    • Direct consequence of multi-turn with high turn count – the user+assistant messages of each turn get appended to the end of the conversation and remain in the context window
    • Pulling large files into context, pasting stack traces, etc.
    • Long system/tool/MCP context prompts
    • Often 1M token context window limit
  • Extremely high KV cache hit rate
  • Tool use
    • The harness runs tools locally such as grep, searching the web, etc. that collect context to be sent to the LLM on subsequent turns
  • Many short gaps between turns
    • Given that many turns do not require human interaction and only wait for simple tool uses to complete, gaps are often quite short between turns
    • There are longer gaps where the human using the coding agent pauses between prompting
  • Subagent spawning
    • When one off tasks are needed that do not require the context of the entire conversation, subagents are often spawned with a fresh context window to perform the tasks
    • E.g., exploring a codebase, searching the web for a specific thing, etc.
    • Many of these often run in parallel
    • Context never gets pulled into main conversation lineage

Some of (but not all) of the allowed optimizations/restrictions:

  • CPU KVCache Offloading (vLLM Connector, LMCache, SGLang HiCache, Dynamo KVBM, cpu DRAM P2P pooling, etc.) (vendors can choose to enable or disable it at their choosing if disabling does better than enabling it), potentially other tiers of offloading in future iterations after agentx v1 such as NVMe KV cache offloading
  • Any type of parallelism, wideEP
  • Native MTP
  • Any precision that is used in real world
  • For Agentic coding situation, we will not allow decreasing the max model length – it must be set at the default max model length

As always, the goal of InferenceX is to highlight the most optimal inference configurations so long as they are actually employed in the real world. This agentic benchmark will shift closer to an end-to-end system benchmark rather than just a chip+kernel benchmark.

InferenceX Integration

To add this scenario to InferenceX, the main challenge is not the integration, rather finding a dataset that accurately reflects production agentic workflows. These datasets are few and far between because they are extremely valuable and not many are made publicly available.

Over the past month in collaboration with AMD (and others in the community that have their core engineers talk to us), we have hosted a proxy that captures internal Claude Code traces. So far, we have collected 36B tokens worth of anonymized traces. In order to respect users' privacy, none of the raw conversation (text) is stored. Rather, prompts are tokenized by the proxy and then hashed into blocks of 64 tokens. The blocks are unique to each session/conversation. This allows traces to be replayed back in a way that emulates KV cache reuse patterns without exposing the original conversation.

Although our sample size is still relatively small, this is the best way to collect and replay realistic production agentic coding traces.

There are many trace replayer implementations that can be used and to be frank, which one matters not. At the core, a trace replayer just spawns clients, fills in anonymous token blocks with synthetic data, constructs multiturn conversations as they were originally captured, sends these conversations to an API server, and then records basic telemetry data like TTFT, TPOT, QPS, etc.

For InferenceX integration, we choose to use Callan Fox's (WEKA) replayer, which can be found here. We chose this because Callan has done a lot of work over the past year on this topic and has been using this replayer himself. It is simple. Beyond this, there is no legitimate reason this replayer should necessarily be chosen over another, or vice versa. Trace replayers are quite simple, especially when the cost of coding custom solutions has gone to zero. This can be changed at any time.

To make the scope manageable, we will prioritize running this benchmark on the following models:

  • DeepSeek v4 (R1 until v4 comes out)
  • GLM
  • Kimi
  • MiniMax

Methodology

In this section we will explain the way in which we will run the agentic benchmark and collect results.

To obtain a Pareto curve of configurations, we will "sweep" among the number of concurrent conversations. Each client replays one trace at a time. When a client finishes a trace, the trace gets recycled into the queue of available traces. The traces will be replayed at their original speed, including the time between turns but not including the time it took to receive the response originally. We cap the maximum time between turns at 60 seconds.

The traces may / will include subagents. Many of these can be launched in parallel, thus increasing the number of total inflight requests above the specified concurrency for a brief period. The assumption for the benchmark is that all subagents will be routed to the same model, regardless of what model the subagent was routed to in the original trace.

Some of the traces will have context length longer than the model's max_model_len. When this occurs, the conversations will be truncated accordingly.

In order to simulate a steady state start, the initial herd of conversations are randomly started anywhere between 0-70% of their conversation length. For warmup, we will prefill all of these requests before beginning the metrics collection of the benchmark. We will collect metrics from the end of warmup (when the steady-state benchmark begins) and the end of the duration – we will not capture metrics during cooldown (when in-flight requests are finishing).

For this MVP, the benchmarks have been running for 30 minutes which have been showing steady state results. We can potentially run the benchmarks for a shorter duration, but likely 15 minutes at the very minimum in order to actually run deep enough to reach steady state and trigger KV offloading (if enabled). Developers can also test with a shorter duration to get a proxy of how the configurations perform and then run for longer on official submission. We will need to decide on official guidelines for submission such that all submissions are on an even playing field. The main issue for consideration here is that lower concurrency runs see fewer requests by construction.

We will collect P50, P75, P90, and P99 metrics for TTFT, TPOT, etc. We can decide later on what makes sense to actually advertise statistically on inferencex dot com based on the sample sizes we observe.

For this MVP, we are not using traces captured by our internal Claude Code proxy. This is because of the immaturity of the proxy that we must iterate on to collect a sufficient amount of quality traces. Luckily, WEKA has provided some great traces already that can be found here. The HuggingFace dataset card shows the metrics for the traces in this dataset. You can see that the defining statistics are roughly the same as those we collected internally. The main "issue" with this dataset is it is slightly out of date, and has far fewer subagents spawned in the conversations. This ultimately causes higher cache hit rates because of fewer context branches (i.e., the conversations only grow and never split, leading to fewer prefill passes).

To be clear, we plan on using our internal traces soon. We are using the existing traces for this MVP so that we can iterate fast.

Outstanding Considerations

There are some outstanding things to consider when replaying traces:

How to fill in anonymized token blocks. There are a few options:

  • Use synthetic data
    • Pros: easy and straightforward, will not affect MTP acceptance rates since we plan on using synthetic AR being implemented by vLLM, TRT-LLM, SGLang, etc.
    • vLLM maintainers have already implemented this in vLLM and SGLang has an implemntation too
    • add acceptance_threshold argument vllm-project/vllm#35355
  • [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling vllm-project/vllm#38045
  • Cons: models will have gibberish output, unclear how this will affect EPLB, especially in wide EP disagg scenarios where expert routing is quite important and can affect performance significantly
  • (What we choose to use for this MVP) Use synthetic data that models coherent English. The trace replayer currently has a vocabulary.py that has a pool of words related to different things (coding, etc.) and sentence structures, then attempts to model semi-coherent English sentences for the prompts
    • Pros: distribution of model outputs should be more real world
    • Cons: slightly more difficult, still not clear that the model won't have gibberish output, still unclear how this will affect EPLB
  • Use real data from an open-source dataset such as this one or something like SWE-Bench
    • Pros: Almost certainly would balance expert activation more, closer to real Claude traces, model output will be coherent
    • Cons: Quite difficult to implement, unclear what the benefit actually is
  • Just capture non-anonymized traces
    • Pros: none of the problems with EPLB as described above, completely accurate model distribution
    • Cons: have to capture non-anonymized traces potentially jeopardizing the privacy of those who participate in trace collection efforts

Whether to follow the original traces output token lengths exactly or let the model generate naturally

  • If the original output length is enforced, we must run the server with --ignore-eos, which causes gibberish output
  • If it is not enforced, we may see drastically different distributions based on the model. We also may see different distributions on a run-to-run basis if temperature is not set to 0. We do not plan on running this benchmark on temperature 0 as this has been shown to artificially inflate throughput slightly

How to ensure some level of determinism without encouraging gamifying of the workload shapes?

  • Currently, the trace replayer replays conversations in a random order subject to a seed. In order to ensure fairness across different runs, it is beneficial to have this seeding
    • Pros: determinism between runs, traces will be played in the same order each time
    • Cons: potential for engineers to gamify benchmark by analyzing the specific request order, scenarios with low concurrency will always see the same subset of requests – what if this is enough variability for low concurrency runs?
  • Another option is to introduce some randomness between runs so that there is natural variability
    • Pros: reduces opportunity for gaming, more representative of the dataset as a whole, especially when we increase the dataset from ~750 traces to O(thousands) of traces
    • Cons: less determinism between runs and more variability between configs, opportunity to have a "good run" and "bad run", will be more of a problem for low concurrency scenarios where fewer overall requests are seen, for higher concurrency scenarios the law of large numbers should reduce extent of variance

Is running for a fixed duration actually the correct way to do things?

  • This means that lower concurrency will see far fewer requests in total than higher concurrency scenarios by nature
  • May be better to have variable min number of requests based on concurrency to reach a specific confidence interval for a given SLA
  • How can we figure out how many prompts to send given concurrency to reach a certain confidence level mathematically to ensure no bias?

@cquil11 cquil11 force-pushed the chore/agentx-v0.1 branch from 193b514 to 7241d5b Compare April 27, 2026 21:07
Comment thread .github/workflows/e2e-tests.yml Dismissed
Comment thread .github/workflows/run-sweep.yml Fixed
@cquil11 cquil11 force-pushed the chore/agentx-v0.1 branch from 7241d5b to 5c94fd3 Compare April 27, 2026 21:20
@cquil11 cquil11 force-pushed the chore/agentx-v0.1 branch 2 times, most recently from 81cd578 to 9b10a34 Compare April 27, 2026 22:08
Adds end-to-end agentic-coding benchmark infrastructure on top of the
existing fixed-seq-len harness. New components:

Trace replayer
- New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized)
  driving multi-turn HF-dataset traces against any OpenAI-compatible
  endpoint at fixed concurrency.
- --debug-trace captures full per-request prompt/response, every
  streamed chunk via chunk.model_dump(), and integer token IDs
  (apply_chat_template prompt + logprobs.content completion) into
  debug_trace.jsonl.
- Per-model delta-field abstraction (gpt-oss → delta.reasoning, default
  → delta.reasoning_content) so reasoning-heavy responses are counted
  and appended to conversation history correctly.
- Input-token metric reads server's usage.prompt_tokens (authoritative)
  rather than the local apply_chat_template estimate which breaks for
  gpt-oss harmony's chat template.
- Per-user 8-token salt prefix on conversation[0] so two in-flight
  users replaying the same trace_id don't accidentally share KV-cache
  blocks.
- Period summary: counts up elapsed instead of down remaining; replaces
  the dispatch-jitter "Wait time" with the trace's true "Inter-turn
  time" sourced from RequestMetrics.delay_expected.
- 5s quiesce between warmup completion and metrics-collector start so
  warmup-tail prefill doesn't bleed into period 1.

Workflow plumbing
- e2e-tests.yml: workflow_dispatch + workflow_call inputs for
  debug-trace (boolean) and duration-override (string seconds), forwarded
  to test-sweep-agentic and test-sweep-multi-node-agentic jobs.
- benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input
  mapped to DEBUG_TRACE env var; duration override threads through to
  matrix.config.duration.
- benchmark_lib.sh: build_replay_cmd / resolve_trace_source /
  install_agentic_deps / write_agentic_result_json helpers; consumes
  DEBUG_TRACE → --debug-trace.
- runners/launch_*.sh: shared agentic mode dispatch + scenario routing.
- runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to
  match the actual runner.name observed by the workflow.

Result aggregation
- utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector
  (vllm/sglang Prometheus parsers), pareto plotter, per-config
  distribution analyzer, sweep aggregator.
- utils/process_agentic_result.py: per-job results.json builder.
- utils/matrix_logic: agentic-coding scenario plumbing in
  generate_sweep_configs.py + validation.py.

Examples (one per vendor)
- benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA.
- benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD.
- Matching agentic-coding sections in nvidia-master.yaml
  (dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang).

All other model-specific launchers and matrix entries are deliberately
left out of this PR; downstream PRs add them on a per-model basis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 force-pushed the chore/agentx-v0.1 branch from 9b10a34 to d94f011 Compare April 27, 2026 22:17
@cquil11 cquil11 changed the title Chore/agentx v0.1 [disclaimer: MVP/experimental] feat: agentic trace replay benchmark v0.1 -- no more 1k1k 8k1k Apr 27, 2026
@cquil11 cquil11 changed the title [disclaimer: MVP/experimental] feat: agentic trace replay benchmark v0.1 -- no more 1k1k 8k1k [disclaimer: MVP/experimental] feat: agentic trace replay benchmark v0.1 Apr 27, 2026
@functionstackx functionstackx changed the title [disclaimer: MVP/experimental] feat: agentic trace replay benchmark v0.1 [disclaimer: MVP/experimental] feat: agentic trace replay benchmark MVP v0.1 Apr 27, 2026
@glbyktjys glbyktjys self-requested a review April 27, 2026 22:36
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants