From 4796980b1808cfb4e90dc93e202e5c9c7972fc7f Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Tue, 10 Feb 2026 16:16:30 -0800 Subject: [PATCH 1/7] initial commit Signed-off-by: Andrew Xia --- _posts/2026-02-15-responses-api.md | 229 +++++++++++++++++++++++++++++ 1 file changed, 229 insertions(+) create mode 100644 _posts/2026-02-15-responses-api.md diff --git a/_posts/2026-02-15-responses-api.md b/_posts/2026-02-15-responses-api.md new file mode 100644 index 00000000..8641ce20 --- /dev/null +++ b/_posts/2026-02-15-responses-api.md @@ -0,0 +1,229 @@ +--- +layout: post +title: "Enabling ResponsesAPI and MCP on vLLM" +author: "Meta" +image: /assets/figures/2026-02-03-dsr1-gb200/topline_comparison.png +redirect_from: + - /2026/02/03/dsr1-gb200.html +--- + +# TODO: edit this file + +# Introduction + +Building on our [previous work](https://blog.vllm.ai/2025/12/17/large-scale-serving.html) achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog details the key optimizations that enable vLLM to achieve **26.2K prefill TPGS (tokens per GPU second)** and **10.1K decode TPGS on GB200** using workload of **2K input tokens** and **2K output tokens** for DeepSeek-style MoE models including DeepSeek R1/V3/V3.1. And the above numbers are collected through a deployment with 4 prefill instances (each with 2 GB200) and 1 decode instance (with 8 GB200), all utilizing a combination of data-parallelism (DP) and expert-parallelism (EP). + +These gains are driven by a combination of new optimizations: + +**New Optimizations:** + +* Lower-precision operations ([NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) GEMM, FP8 GEMM, NVFP4 MoE Dispatch) +* Kernel fusion (RoPE+Quant+Q write, RoPE+Quant, Concat K) +* Scaling down prefill via weight offloading +* Minimized chunking overheads + +**Previously Discussed Features:** + +* Async scheduling +* Prefill/decode disaggregated serving + +The combination of GB200's increased compute capability and these targeted optimizations results in a significant throughput improvement over H200 deployments. + +# Results + +The following benchmarks compare vLLM performance on GB200 versus H200 for DeepSeek-V3/R1 workloads using a fixed workload of 2K input tokens and 2K output tokens. Detailed deployment setup can be found in the following table. + +*![][topline_comparison]* + +| Deployment setup | H200 | GB200 | +| :---- | :---- | :---- | +| Prefill | 16 GPUs | 8 GPUs (4 instances x 2 GPUs) | +| Decode | 32 GPUs | 8 GPUs (1 instance x 8 GPUs) | + +The GB200's increased memory bandwidth (8 TB/s vs 4.8 TB/s), higher compute throughput through FP4, and NVLink-C2C interconnect between CPU and GPU all contribute to these gains. We maximized this potential by applying the optimizations detailed below. + +We also benchmarked the DeepSeek-V3/R1 decode throughput on GB200 for a range of standard workloads, maintaining the same parallelism setup while varying the decode batch size that fully utilizates GPU memory. + +Instructions for reproducing all benchmark results can be found [here](https://github.com/vllm-project/vllm/issues/33583). + +![][decode_throughput_various] + +# Key Optimizations + +## Lower-Precision Operations + +GB200 introduces significantly higher throughput for FP4 and FP8 operations compared to H200. vLLM leverages these capabilities through several precision optimizations. + +### NVFP4 GEMM (MoE GEMMs, O-proj) + +DeepSeek-V3/R1 models can be quantized to FP4 precision for the MoE expert weights and output projection layers. vLLM integrates FlashInfer's TRTLLM-Gen GEMM kernels, which are specifically optimized for GB200's FP4 tensor cores. + +The FP4 checkpoint format stores weights in a packed 4-bit representation with per-group scaling factors. At runtime, the TRTLLM-Gen kernels dequantize on-the-fly within the tensor cores, achieving near-native FP4 throughput while maintaining model quality. + +Key implementation details: + +* FP4 weights with FP8 or FP16 scales stored in a packed format +* FlashInfer TRTLLM-Gen kernels optimized for GB200 tensor core scheduling +* Applied to MoE expert GEMMs and attention output projection (O-proj) + +### FP8 GEMM for MLA + +For DeepSeek's Multi-head Latent Attention (MLA), the query up-projection (from latent space to full query dimensions) benefits from FP8 quantization. Unlike the MoE layers where FP4 provides the best throughput/accuracy tradeoff, the attention projections are more sensitive to quantization and the accuracy benefits from FP8's higher precision. + +vLLM uses optimized FP8 GEMM kernels for these projections, achieving significant speedup over FP16 while maintaining attention quality. + +### NVFP4 MoE Dispatch + +Beyond the expert GEMMs themselves, the MoE dispatch operation—which routes tokens to their assigned experts—can also benefit from lower precision. vLLM implements NVFP4 dispatch, quantizing token activations to FP4 before the all-to-all communication. + +This reduces the all-to-all communication volume by 4x compared to FP16 dispatch, significantly decreasing inter-GPU communication latency in EP deployments. The quantization overhead is amortized across the communication savings, resulting in net throughput gains. + +## Kernel Fusion + +There are several kernel fusion strategies that reduce memory bandwidth consumption and kernel launch overhead by combining multiple operations into single GPU kernels. + +### RoPE \+ Quant \+ Q Write (Decode) + +During decode, the query projection requires: + +1. RoPE (Rotary Position Embedding) application +2. Quantization for the subsequent GEMM +3. Writing to the query buffer + +vLLM fuses these three operations into a single kernel, eliminating two intermediate memory round-trips. + +

+ +
+RoPE+Quant+Q Write Fusion in Decode +

+ +### RoPE \+ Quant (Prefill) + +Similarly for prefill, RoPE application and quantization are fused. The prefill path handles larger token batches, making the memory bandwidth savings from fusion even more impactful. + +### Concat K Optimization + +For MLA key projections, vLLM implements an optimized concatenation operation using FlashInfer's `concat_mla_k` kernel. In DeepSeek's MLA architecture, the key tensor is composed of two parts: the non-positional embedding part (k\_nope, per-head) and the rotary positional embedding part (k\_rope, shared across all heads). These must be concatenated to form the full key tensor. + +The naive approach requires copying k\_nope and broadcasting k\_rope across all 128 heads, resulting in significant memory bandwidth consumption. FlashInfer's `concat_mla_k` kernel implements several optimizations: + +* **Warp-based processing**: Each warp handles one (token, head\_chunk) pair, processing 16 heads at a time +* **Vectorized memory access**: Uses 8-byte vector loads for nope data and 4-byte loads for rope data, maximizing memory throughput +* **Software pipelining with L2 prefetching**: Prefetches the next row while processing the current row, hiding memory latency +* **Register reuse for rope values**: Since rope is shared across all heads, it is loaded once into registers and written to all 16 heads in the chunk, avoiding redundant memory loads + +## Scaling Down Prefill + +### Why Scaling Down Makes Sense + +When considering GPU count for throughput-oriented inference serving, we typically scale out either to fit the model or to shard memory (experts, context) to increase batch size. However, for prefill workloads that are already compute-bounded, reducing GPU count can actually improve throughput by reducing communication overhead. + +Our microbenchmarks show that MLA backend throughput performance starts plateauing when batch size increases from 16K to 64K tokens. Beyond 64K tokens, MoE throughput gains are also negligible. This means we can saturate compute utilization with a batch size that fits in a 2-GPU serving setup. + +

+ + +
+MLA and MoE throughput plateau at ~64K batch size +

+ +By reducing GPU count from 4 to 2, we halve the NCCL collectives (all\_gather and reduce\_scatter) for EP communication, significantly reducing communication overhead. + +

+ + +
+Reducing EP degree halves communication overhead +

+ +### Weight Offloading v2 + +To reduce GPU memory footprint while maintaining performance, vLLM implements weight offloading v2 with asynchronous prefetching. This v2 implementation was inspired by the offloading approach in [SGLang prefill](https://github.com/sgl-project/sglang/pull/8034) and now adapted for additional compatibility with torch.compile and CUDA graph within vLLM. + +In vLLM weight offloading v1, offloaded weights stayed on CPU and were accessed via Unified Virtual Addressing (UVA), which incurs slow PCIe transfer delays. This was intended as a last resort for running models with limited GPU resources. + +Weight offloading v2 takes a different approach: it explicitly copies (onloads) weights to GPU in advance. The key innovation is onloading the weights of the next layer asynchronously on a separate CUDA stream. By carefully overlapping weight onloading with kernel execution, the onloading delay can be completely hidden. + +Users configure offloading via group-based selection: +![][layer_group] + +* `group_size`: Group every N layers together +* `num_in_group`: Offload this many layers per group (last N of each group) +* `prefetch_step`: Number of layers to prefetch ahead + +For DeepSeek-R1 prefill serving, we offload one of every two MoE GEMM weights, achieving significant memory savings while maintaining full throughput. + +

+ +
+Trace showing weight onload overlapping with layer execution +

+ +GB200's NVLink-C2C connection between CPU and GPU makes weight offloading v2 particularly effective, as the loading latency is minimized compared to PCIe-based systems. + +## Minimize Chunking Overheads + +Large batch processing in MoE models requires chunking to fit within GPU memory constraints. However, smaller chunks introduce overhead from repeated kernel launches and synchronization, creating GPU bubbles. vLLM provides chunk size configuration options to maximize throughput while staying within memory limits. + +### MoE DP Chunk + +When using Data Parallel with Expert Parallel (DP+EP), tokens are dispatched from each DP rank in coordinated chunks. The `VLLM_ENABLE_MOE_DP_CHUNK` flag (enabled by default) enables this chunking behavior. + +Larger chunk sizes reduce GPU bubbles by amortizing dispatch/combine overhead across more tokens. The chunk size is controlled by `VLLM_MOE_DP_CHUNK_SIZE` (default: 256 tokens). Increasing this value improves throughput by reducing synchronization frequency. + +For GB200, we disable MoE DP chunking (`VLLM_ENABLE_MOE_DP_CHUNK=0`) for prefill and set `VLLM_MOE_DP_CHUNK_SIZE` to match the batch size for decode. + +### MoE Activation Chunk + +For large prefill batches, vLLM chunks activation tensors to process subsets of tokens through the MoE layers. The `VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING` flag controls this behavior (enabled by default). + +Larger chunk sizes improve throughput by reducing launch overhead and providing sufficient work to fully utilize GPU compute. The chunk size is controlled by `VLLM_FUSED_MOE_CHUNK_SIZE` (default: 16K tokens). The optimal setting maximizes chunk size within available GPU memory. + +For GB200, we disable activation chunking (`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING=0`) to maximize throughput, as the larger memory capacity accommodates full batches without chunking. + +### Output Processing Chunk + +In the V1 engine's async serving path, output processing (logit computation, sampling, response generation) is chunked. The `VLLM_V1_OUTPUT_PROC_CHUNK_SIZE` controls the number of outputs processed per iteration (default: 128). + +Larger chunk sizes improve overall throughput by reducing per-chunk overhead. However, for streaming workloads, very large chunks may increase inter-message latency variance. For throughput-optimized decode on GB200, we set the chunk size to 2048\. + +# Future Work + +The vLLM team is actively working on the following improvements for GB200 deployments: + +1. **Improving load balancedness and scaling up EP**: Extending expert load balancing to handle larger EP degrees and more dynamic workloads, with improved rebalancing algorithms. +2. **Optimizing MoE dispatch latency**: Further reducing the latency of all-to-all dispatch operations through kernel optimizations and communication scheduling. +3. **Hiding communication latency via compute-communication overlap**: Achieving higher GPU utilization in communication-bound scenarios through more aggressive overlapping strategies. +4. **Expanding WideEP and Large-Scale Serving on GB300**: By utilizing GB300’s superior HBM and compute capabilities, we aim to further our WideEP and large-scale serving work, targeting higher TPGS with a reduced host footprint. + +For the most up-to-date reference, see [roadmap.vllm.ai](http://roadmap.vllm.ai). + +# Summary + +* vLLM achieves 26.2K prefill TPGS and 10.1K decode TPGS for DeepSeek-style MoE models, representing 3-5x improvement over H200. +* Lower-precision operations (NVFP4 GEMM, FP8 GEMM, NVFP4 dispatch) leverage GB200's enhanced tensor core capabilities. +* Kernel fusion reduces memory bandwidth pressure and kernel launch overhead. +* Scaling down prefill via weight offloading v2 reduces EP communication overhead while maintaining compute saturation. +* Chunking optimizations controlled via environment variables minimize overhead for large batch processing. + +# Team + +* Meta: Andrew Xia +* NVIDIA: Duncan Moss, Cyrus Chang, Andrew Briand, Siyuan Fu, Hanjie Qiu, Jason Li, Pavani Majety, Xin Li, Chirayu Garg, Abhinav Singh, Minseok Lee + +# References + +* [vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP](https://blog.vllm.ai/2025/12/17/large-scale-serving.html) +* [FlashInfer: Kernel Library for LLM Serving](https://github.com/flashinfer-ai/flashinfer) +* [NVIDIA GB200 NVL72 Architecture](https://www.nvidia.com/en-us/data-center/gb200-nvl72/) + +[decode_throughput_various]: /assets/figures/2026-02-03-dsr1-gb200/decode_throughput_various.png +[layer_group]: /assets/figures/2026-02-03-dsr1-gb200/layer_group.png +[mla_trtllm_ragged_prefill_prefill]: /assets/figures/2026-02-03-dsr1-gb200/mla_trtllm_ragged_prefill_prefill.png +[moe_flashinfer_trtllm_nvfp4_prefill]: /assets/figures/2026-02-03-dsr1-gb200/moe_flashinfer_trtllm_nvfp4_prefill.png +[nccl_all_gather]: /assets/figures/2026-02-03-dsr1-gb200/nccl_all_gather.png +[nccl_reduce_scatter]: /assets/figures/2026-02-03-dsr1-gb200/nccl_reduce_scatter.png +[onloading_trace]: /assets/figures/2026-02-03-dsr1-gb200/onloading_trace.png +[rope_quant_fusion_timeline]: /assets/figures/2026-02-03-dsr1-gb200/rope_quant_fusion_timeline.png +[topline_comparison]: /assets/figures/2026-02-03-dsr1-gb200/topline_comparison.png From 2e30e4de7eee4fa32e93ec2be00a50b518cd0888 Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Tue, 10 Feb 2026 16:49:33 -0800 Subject: [PATCH 2/7] updates Signed-off-by: Andrew Xia --- _posts/2026-02-15-responses-api.md | 229 ------------------------ _posts/2026-03-11-responses-api.md | 272 +++++++++++++++++++++++++++++ 2 files changed, 272 insertions(+), 229 deletions(-) delete mode 100644 _posts/2026-02-15-responses-api.md create mode 100644 _posts/2026-03-11-responses-api.md diff --git a/_posts/2026-02-15-responses-api.md b/_posts/2026-02-15-responses-api.md deleted file mode 100644 index 8641ce20..00000000 --- a/_posts/2026-02-15-responses-api.md +++ /dev/null @@ -1,229 +0,0 @@ ---- -layout: post -title: "Enabling ResponsesAPI and MCP on vLLM" -author: "Meta" -image: /assets/figures/2026-02-03-dsr1-gb200/topline_comparison.png -redirect_from: - - /2026/02/03/dsr1-gb200.html ---- - -# TODO: edit this file - -# Introduction - -Building on our [previous work](https://blog.vllm.ai/2025/12/17/large-scale-serving.html) achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog details the key optimizations that enable vLLM to achieve **26.2K prefill TPGS (tokens per GPU second)** and **10.1K decode TPGS on GB200** using workload of **2K input tokens** and **2K output tokens** for DeepSeek-style MoE models including DeepSeek R1/V3/V3.1. And the above numbers are collected through a deployment with 4 prefill instances (each with 2 GB200) and 1 decode instance (with 8 GB200), all utilizing a combination of data-parallelism (DP) and expert-parallelism (EP). - -These gains are driven by a combination of new optimizations: - -**New Optimizations:** - -* Lower-precision operations ([NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) GEMM, FP8 GEMM, NVFP4 MoE Dispatch) -* Kernel fusion (RoPE+Quant+Q write, RoPE+Quant, Concat K) -* Scaling down prefill via weight offloading -* Minimized chunking overheads - -**Previously Discussed Features:** - -* Async scheduling -* Prefill/decode disaggregated serving - -The combination of GB200's increased compute capability and these targeted optimizations results in a significant throughput improvement over H200 deployments. - -# Results - -The following benchmarks compare vLLM performance on GB200 versus H200 for DeepSeek-V3/R1 workloads using a fixed workload of 2K input tokens and 2K output tokens. Detailed deployment setup can be found in the following table. - -*![][topline_comparison]* - -| Deployment setup | H200 | GB200 | -| :---- | :---- | :---- | -| Prefill | 16 GPUs | 8 GPUs (4 instances x 2 GPUs) | -| Decode | 32 GPUs | 8 GPUs (1 instance x 8 GPUs) | - -The GB200's increased memory bandwidth (8 TB/s vs 4.8 TB/s), higher compute throughput through FP4, and NVLink-C2C interconnect between CPU and GPU all contribute to these gains. We maximized this potential by applying the optimizations detailed below. - -We also benchmarked the DeepSeek-V3/R1 decode throughput on GB200 for a range of standard workloads, maintaining the same parallelism setup while varying the decode batch size that fully utilizates GPU memory. - -Instructions for reproducing all benchmark results can be found [here](https://github.com/vllm-project/vllm/issues/33583). - -![][decode_throughput_various] - -# Key Optimizations - -## Lower-Precision Operations - -GB200 introduces significantly higher throughput for FP4 and FP8 operations compared to H200. vLLM leverages these capabilities through several precision optimizations. - -### NVFP4 GEMM (MoE GEMMs, O-proj) - -DeepSeek-V3/R1 models can be quantized to FP4 precision for the MoE expert weights and output projection layers. vLLM integrates FlashInfer's TRTLLM-Gen GEMM kernels, which are specifically optimized for GB200's FP4 tensor cores. - -The FP4 checkpoint format stores weights in a packed 4-bit representation with per-group scaling factors. At runtime, the TRTLLM-Gen kernels dequantize on-the-fly within the tensor cores, achieving near-native FP4 throughput while maintaining model quality. - -Key implementation details: - -* FP4 weights with FP8 or FP16 scales stored in a packed format -* FlashInfer TRTLLM-Gen kernels optimized for GB200 tensor core scheduling -* Applied to MoE expert GEMMs and attention output projection (O-proj) - -### FP8 GEMM for MLA - -For DeepSeek's Multi-head Latent Attention (MLA), the query up-projection (from latent space to full query dimensions) benefits from FP8 quantization. Unlike the MoE layers where FP4 provides the best throughput/accuracy tradeoff, the attention projections are more sensitive to quantization and the accuracy benefits from FP8's higher precision. - -vLLM uses optimized FP8 GEMM kernels for these projections, achieving significant speedup over FP16 while maintaining attention quality. - -### NVFP4 MoE Dispatch - -Beyond the expert GEMMs themselves, the MoE dispatch operation—which routes tokens to their assigned experts—can also benefit from lower precision. vLLM implements NVFP4 dispatch, quantizing token activations to FP4 before the all-to-all communication. - -This reduces the all-to-all communication volume by 4x compared to FP16 dispatch, significantly decreasing inter-GPU communication latency in EP deployments. The quantization overhead is amortized across the communication savings, resulting in net throughput gains. - -## Kernel Fusion - -There are several kernel fusion strategies that reduce memory bandwidth consumption and kernel launch overhead by combining multiple operations into single GPU kernels. - -### RoPE \+ Quant \+ Q Write (Decode) - -During decode, the query projection requires: - -1. RoPE (Rotary Position Embedding) application -2. Quantization for the subsequent GEMM -3. Writing to the query buffer - -vLLM fuses these three operations into a single kernel, eliminating two intermediate memory round-trips. - -

- -
-RoPE+Quant+Q Write Fusion in Decode -

- -### RoPE \+ Quant (Prefill) - -Similarly for prefill, RoPE application and quantization are fused. The prefill path handles larger token batches, making the memory bandwidth savings from fusion even more impactful. - -### Concat K Optimization - -For MLA key projections, vLLM implements an optimized concatenation operation using FlashInfer's `concat_mla_k` kernel. In DeepSeek's MLA architecture, the key tensor is composed of two parts: the non-positional embedding part (k\_nope, per-head) and the rotary positional embedding part (k\_rope, shared across all heads). These must be concatenated to form the full key tensor. - -The naive approach requires copying k\_nope and broadcasting k\_rope across all 128 heads, resulting in significant memory bandwidth consumption. FlashInfer's `concat_mla_k` kernel implements several optimizations: - -* **Warp-based processing**: Each warp handles one (token, head\_chunk) pair, processing 16 heads at a time -* **Vectorized memory access**: Uses 8-byte vector loads for nope data and 4-byte loads for rope data, maximizing memory throughput -* **Software pipelining with L2 prefetching**: Prefetches the next row while processing the current row, hiding memory latency -* **Register reuse for rope values**: Since rope is shared across all heads, it is loaded once into registers and written to all 16 heads in the chunk, avoiding redundant memory loads - -## Scaling Down Prefill - -### Why Scaling Down Makes Sense - -When considering GPU count for throughput-oriented inference serving, we typically scale out either to fit the model or to shard memory (experts, context) to increase batch size. However, for prefill workloads that are already compute-bounded, reducing GPU count can actually improve throughput by reducing communication overhead. - -Our microbenchmarks show that MLA backend throughput performance starts plateauing when batch size increases from 16K to 64K tokens. Beyond 64K tokens, MoE throughput gains are also negligible. This means we can saturate compute utilization with a batch size that fits in a 2-GPU serving setup. - -

- - -
-MLA and MoE throughput plateau at ~64K batch size -

- -By reducing GPU count from 4 to 2, we halve the NCCL collectives (all\_gather and reduce\_scatter) for EP communication, significantly reducing communication overhead. - -

- - -
-Reducing EP degree halves communication overhead -

- -### Weight Offloading v2 - -To reduce GPU memory footprint while maintaining performance, vLLM implements weight offloading v2 with asynchronous prefetching. This v2 implementation was inspired by the offloading approach in [SGLang prefill](https://github.com/sgl-project/sglang/pull/8034) and now adapted for additional compatibility with torch.compile and CUDA graph within vLLM. - -In vLLM weight offloading v1, offloaded weights stayed on CPU and were accessed via Unified Virtual Addressing (UVA), which incurs slow PCIe transfer delays. This was intended as a last resort for running models with limited GPU resources. - -Weight offloading v2 takes a different approach: it explicitly copies (onloads) weights to GPU in advance. The key innovation is onloading the weights of the next layer asynchronously on a separate CUDA stream. By carefully overlapping weight onloading with kernel execution, the onloading delay can be completely hidden. - -Users configure offloading via group-based selection: -![][layer_group] - -* `group_size`: Group every N layers together -* `num_in_group`: Offload this many layers per group (last N of each group) -* `prefetch_step`: Number of layers to prefetch ahead - -For DeepSeek-R1 prefill serving, we offload one of every two MoE GEMM weights, achieving significant memory savings while maintaining full throughput. - -

- -
-Trace showing weight onload overlapping with layer execution -

- -GB200's NVLink-C2C connection between CPU and GPU makes weight offloading v2 particularly effective, as the loading latency is minimized compared to PCIe-based systems. - -## Minimize Chunking Overheads - -Large batch processing in MoE models requires chunking to fit within GPU memory constraints. However, smaller chunks introduce overhead from repeated kernel launches and synchronization, creating GPU bubbles. vLLM provides chunk size configuration options to maximize throughput while staying within memory limits. - -### MoE DP Chunk - -When using Data Parallel with Expert Parallel (DP+EP), tokens are dispatched from each DP rank in coordinated chunks. The `VLLM_ENABLE_MOE_DP_CHUNK` flag (enabled by default) enables this chunking behavior. - -Larger chunk sizes reduce GPU bubbles by amortizing dispatch/combine overhead across more tokens. The chunk size is controlled by `VLLM_MOE_DP_CHUNK_SIZE` (default: 256 tokens). Increasing this value improves throughput by reducing synchronization frequency. - -For GB200, we disable MoE DP chunking (`VLLM_ENABLE_MOE_DP_CHUNK=0`) for prefill and set `VLLM_MOE_DP_CHUNK_SIZE` to match the batch size for decode. - -### MoE Activation Chunk - -For large prefill batches, vLLM chunks activation tensors to process subsets of tokens through the MoE layers. The `VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING` flag controls this behavior (enabled by default). - -Larger chunk sizes improve throughput by reducing launch overhead and providing sufficient work to fully utilize GPU compute. The chunk size is controlled by `VLLM_FUSED_MOE_CHUNK_SIZE` (default: 16K tokens). The optimal setting maximizes chunk size within available GPU memory. - -For GB200, we disable activation chunking (`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING=0`) to maximize throughput, as the larger memory capacity accommodates full batches without chunking. - -### Output Processing Chunk - -In the V1 engine's async serving path, output processing (logit computation, sampling, response generation) is chunked. The `VLLM_V1_OUTPUT_PROC_CHUNK_SIZE` controls the number of outputs processed per iteration (default: 128). - -Larger chunk sizes improve overall throughput by reducing per-chunk overhead. However, for streaming workloads, very large chunks may increase inter-message latency variance. For throughput-optimized decode on GB200, we set the chunk size to 2048\. - -# Future Work - -The vLLM team is actively working on the following improvements for GB200 deployments: - -1. **Improving load balancedness and scaling up EP**: Extending expert load balancing to handle larger EP degrees and more dynamic workloads, with improved rebalancing algorithms. -2. **Optimizing MoE dispatch latency**: Further reducing the latency of all-to-all dispatch operations through kernel optimizations and communication scheduling. -3. **Hiding communication latency via compute-communication overlap**: Achieving higher GPU utilization in communication-bound scenarios through more aggressive overlapping strategies. -4. **Expanding WideEP and Large-Scale Serving on GB300**: By utilizing GB300’s superior HBM and compute capabilities, we aim to further our WideEP and large-scale serving work, targeting higher TPGS with a reduced host footprint. - -For the most up-to-date reference, see [roadmap.vllm.ai](http://roadmap.vllm.ai). - -# Summary - -* vLLM achieves 26.2K prefill TPGS and 10.1K decode TPGS for DeepSeek-style MoE models, representing 3-5x improvement over H200. -* Lower-precision operations (NVFP4 GEMM, FP8 GEMM, NVFP4 dispatch) leverage GB200's enhanced tensor core capabilities. -* Kernel fusion reduces memory bandwidth pressure and kernel launch overhead. -* Scaling down prefill via weight offloading v2 reduces EP communication overhead while maintaining compute saturation. -* Chunking optimizations controlled via environment variables minimize overhead for large batch processing. - -# Team - -* Meta: Andrew Xia -* NVIDIA: Duncan Moss, Cyrus Chang, Andrew Briand, Siyuan Fu, Hanjie Qiu, Jason Li, Pavani Majety, Xin Li, Chirayu Garg, Abhinav Singh, Minseok Lee - -# References - -* [vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP](https://blog.vllm.ai/2025/12/17/large-scale-serving.html) -* [FlashInfer: Kernel Library for LLM Serving](https://github.com/flashinfer-ai/flashinfer) -* [NVIDIA GB200 NVL72 Architecture](https://www.nvidia.com/en-us/data-center/gb200-nvl72/) - -[decode_throughput_various]: /assets/figures/2026-02-03-dsr1-gb200/decode_throughput_various.png -[layer_group]: /assets/figures/2026-02-03-dsr1-gb200/layer_group.png -[mla_trtllm_ragged_prefill_prefill]: /assets/figures/2026-02-03-dsr1-gb200/mla_trtllm_ragged_prefill_prefill.png -[moe_flashinfer_trtllm_nvfp4_prefill]: /assets/figures/2026-02-03-dsr1-gb200/moe_flashinfer_trtllm_nvfp4_prefill.png -[nccl_all_gather]: /assets/figures/2026-02-03-dsr1-gb200/nccl_all_gather.png -[nccl_reduce_scatter]: /assets/figures/2026-02-03-dsr1-gb200/nccl_reduce_scatter.png -[onloading_trace]: /assets/figures/2026-02-03-dsr1-gb200/onloading_trace.png -[rope_quant_fusion_timeline]: /assets/figures/2026-02-03-dsr1-gb200/rope_quant_fusion_timeline.png -[topline_comparison]: /assets/figures/2026-02-03-dsr1-gb200/topline_comparison.png diff --git a/_posts/2026-03-11-responses-api.md b/_posts/2026-03-11-responses-api.md new file mode 100644 index 00000000..eda89742 --- /dev/null +++ b/_posts/2026-03-11-responses-api.md @@ -0,0 +1,272 @@ +--- +layout: post +title: "Enabling the Responses API and MCP on vLLM" +author: "Meta" +image: /assets/logos/vllm-logo-text-light.png +--- + +The OpenAI **Responses API** is the successor to the Chat Completions API, designed to support agentic workflows with built-in tool use, multi-turn conversation management, and streaming. We have implemented the Responses API in vLLM, enabling any model served by vLLM to participate in agentic pipelines that call tools, execute code, search the web, and reason through complex tasks -- all through a single `POST /v1/responses` endpoint. + +This blog post covers: + +- **The Responses API implementation** in vLLM: endpoint design, streaming and non-streaming modes, and the full set of supported features +- **MCP (Model Context Protocol) integration**: how vLLM connects to external tool servers and executes tool calls during generation +- **Two context architectures**: HarmonyContext for GPT-OSS models and ParsableContext for all other models + +## The Responses API + +### Endpoint Overview + +vLLM exposes three endpoints under the Responses API: + +| Endpoint | Method | Description | +|---|---|---| +| `/v1/responses` | POST | Create a new response (streaming or non-streaming) | +| `/v1/responses/{response_id}` | GET | Retrieve a stored response | +| `/v1/responses/{response_id}/cancel` | POST | Cancel a background response | + +The primary endpoint is `POST /v1/responses`. It accepts a `ResponsesRequest` with an `input` (a string or a list of conversation items), an optional `instructions` field for system messages, and a `tools` list for tool definitions. The `stream` flag determines whether the response is returned as a single JSON object or as a stream of Server-Sent Events (SSE). + +### Non-Streaming Mode + +In non-streaming mode, vLLM generates the full response before returning it. The response includes the model's output items (text messages, function calls, reasoning content), token usage statistics, and a final status (`completed`, `incomplete`, or `failed`). + +```bash +curl -X POST http://localhost:8000/v1/responses \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", + "input": "What is the capital of France?", + "stream": false + }' +``` + +### Streaming Mode + +When `stream: true`, vLLM emits events as SSE with monotonically increasing sequence numbers. The event lifecycle follows a structured pattern: + +1. `response.created` and `response.in_progress` -- the response object is initialized +2. For each output item: + - `response.output_item.added` -- a new output item begins (message, reasoning, function_call, etc.) + - Content-specific delta events (e.g., `response.output_text.delta` for text tokens) + - Content done events (e.g., `response.output_text.done`) + - `response.output_item.done` -- the output item is finalized +3. `response.completed` -- the full response with usage statistics + +This event structure matches the OpenAI Responses API specification, making vLLM a drop-in replacement for clients already using the Responses API. + +```bash +curl -X POST http://localhost:8000/v1/responses \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", + "input": "Explain quicksort in one paragraph.", + "stream": true + }' +``` + +### Supported Features + +The vLLM Responses API supports a broad set of features: + +**Function Calling.** Tools of type `function` can be defined in the `tools` list. The model's output is parsed for tool calls using configurable tool parsers (Hermes, Llama, Mistral, etc.). The `tool_choice` parameter supports `"auto"`, `"none"`, `"required"`, or a named function. When the model emits a function call, it is returned as a `function_call` output item with streaming events `response.function_call_arguments.delta` and `response.function_call_arguments.done`. + +**Reasoning.** The `reasoning` parameter (with an `effort` field) enables chain-of-thought reasoning. Reasoning content is tracked separately from regular output and appears as `ResponseReasoningItem` output items. Streaming emits `response.reasoning_text.delta` and `response.reasoning_text.done` events, allowing clients to display the model's thinking process in real time. + +**Structured Output.** The `text.format` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. + +**Logprobs.** When `include` contains `"message.output_text.logprobs"`, the response includes per-token log probabilities. The `top_logprobs` parameter controls how many top alternatives are returned per token position. + +**Background Mode.** Setting `background: true` (with `store: true`) queues the response for asynchronous generation. The response is returned immediately with status `"queued"` and can be polled or retrieved later via `GET /v1/responses/{response_id}`. Background responses can be cancelled via the cancel endpoint. + +**Conversation Continuation.** The `previous_response_id` field chains responses together for multi-turn conversations. The previous response's output items are prepended to the new request's input, enabling stateful multi-turn dialogue without the client needing to manage conversation history. + +**vLLM-Specific Extensions.** Beyond the standard API, vLLM adds parameters for `priority` (request scheduling priority), `cache_salt` (prefix cache isolation), `seed` (deterministic sampling), `repetition_penalty`, custom `stop` sequences, and `enable_response_messages` (returns raw prompt and output token IDs for debugging). + +## MCP: Model Context Protocol Integration + +The **Model Context Protocol (MCP)** allows LLMs to call external tools during generation. vLLM implements MCP as a first-class feature of the Responses API: when a model generates a tool call, vLLM intercepts it, calls the appropriate MCP tool server, and feeds the result back to the model for the next turn of generation -- all within a single API request. + +### Built-in Tools + +vLLM supports three categories of built-in tools: + +**Web Search** (`web_search_preview`). Enables the model to search the web during generation. Streaming events follow: `response.web_search_call.in_progress` -> `response.web_search_call.searching` -> `response.web_search_call.completed`. The search results are injected back into the conversation for the model to synthesize. + +**Code Interpreter** (`code_interpreter`). Enables the model to write and execute Python code in a sandboxed Docker environment. Streaming events include `response.code_interpreter_call_code.delta` for the generated code and `response.code_interpreter_call.completed` for the execution result. + +**Container** (`container`). Enables the model to execute shell commands in a stateful Docker container, supporting arguments like `cmd`, `workdir`, `env`, and `timeout`. + +### Tool Server Architecture + +vLLM provides two `ToolServer` implementations: + +**`MCPToolServer`**: Connects to external MCP-compatible tool servers over SSE. Multiple servers can be specified via comma-separated URLs. Each server exposes its tools via the MCP protocol, and vLLM discovers available tools at startup via `session.list_tools()`. Tool sessions are created per-request with unique session IDs. + +**`DemoToolServer`**: A lightweight local alternative that uses built-in tool implementations (Exa-based web search, Docker-based Python execution) without requiring an external MCP server. This is useful for development and testing. + +```bash +# Starting vLLM with an MCP tool server +vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --enable-auto-tool-choice \ + --tool-call-parser hermes \ + --tool-server-url http://localhost:3001/sse +``` + +### Agentic Loop + +The core of MCP integration is the agentic loop in `_generate_with_builtin_tools`. This loop: + +1. Generates tokens from the model +2. Checks if the model requested a tool call (`need_builtin_tool_call()`) +3. If yes, calls the tool via the MCP session (`call_tool()`) +4. Appends the tool result to the conversation context +5. Renders the updated conversation as a new prompt +6. Repeats from step 1 + +This loop continues until the model produces a final response without requesting a tool call, or until the `max_tool_calls` limit is reached. The entire multi-turn interaction happens server-side within a single API request. + +### MCP Streaming Events + +When streaming is enabled with MCP tools, vLLM emits fine-grained events for each tool call: + +``` +response.mcp_call.in_progress -- tool call begins +response.mcp_call_arguments.delta -- argument tokens stream in +response.mcp_call_arguments.done -- arguments are complete +response.mcp_call.completed -- tool execution result is available +``` + +This allows clients to display the model's tool interactions in real time, showing what tool is being called, with what arguments, and what the result was. + +## Context Architecture: HarmonyContext vs. ParsableContext + +A key design decision in the Responses API is how to manage the conversation state during multi-turn tool-calling loops. vLLM implements two context architectures to support different model families. + +### HarmonyContext (GPT-OSS Models) + +`HarmonyContext` is designed for GPT-OSS models that use OpenAI's Harmony message format. These models use a channel-based parsing system where the model's output is split into channels (`analysis` for reasoning, `commentary`, and `final` for the actual response). The context tracks messages in the Harmony `Message` format and uses the Harmony tokenizer's `render_for_completion()` to produce token IDs for the next turn. + +Key characteristics: +- Uses `openai_harmony` message types (`Author`, `Message`, `Role`, `StreamState`, `TextContent`) +- Tool recipients are identified by message `recipient` field (e.g., `browser.search`, `python`, `container.exec`) +- Token rendering uses the Harmony encoding's stop tokens for assistant actions +- Per-turn token metrics (input, output, cached, tool output tokens) are tracked for accurate usage reporting + +`StreamingHarmonyContext` extends this for token-by-token streaming, processing each token through the Harmony parser and tracking parser state transitions to emit the correct streaming events. + +### ParsableContext (All Other Models) + +`ParsableContext` is the context for non-GPT-OSS models (Llama, Mistral, Qwen, etc.). It uses vLLM's standard chat template system to render conversations and parses tool calls from the model output using configurable tool parsers. + +Key characteristics: +- Uses `ResponseInputOutputItem` types from the OpenAI SDK (e.g., `ResponseFunctionToolCall`, `ResponseFunctionToolCallOutputItem`) +- Tool calls are identified by the `name` field matching built-in tool names (`code_interpreter`, `web_search_preview`, `container`) +- Prompt rendering uses vLLM's chat template system via `_render_next_turn()` +- Supports the `VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT` environment variable to enable this context + +### SimpleContext + +For non-tool-calling scenarios, `SimpleContext` provides a lightweight context that accumulates raw text and token IDs without any parsing overhead. It is the default for models that do not have tool use enabled. + +### Choosing the Right Context + +The context is selected automatically based on the model type and request configuration: + +| Condition | Context Used | +|---|---| +| GPT-OSS model, streaming | `StreamingHarmonyContext` | +| GPT-OSS model, non-streaming | `HarmonyContext` | +| Non-GPT-OSS, tools enabled, experimental parser | `ParsableContext` | +| Non-GPT-OSS, no tools or simple request | `SimpleContext` | + +## Token Usage Tracking + +The Responses API provides detailed token usage information in the response, including: + +- `input_tokens`: Total prompt tokens across all turns +- `output_tokens`: Total generated tokens +- `input_tokens_details.cached_tokens`: Tokens served from the prefix cache +- `output_tokens_details.reasoning_tokens`: Tokens spent on reasoning (chain-of-thought) + +For multi-turn tool-calling interactions, the `TurnMetrics` class tracks per-turn metrics. Tool output tokens are calculated as the difference between consecutive turns' prompt sizes minus the previous turn's output, capturing the token cost of tool results injected between turns. + +## Getting Started + +To use the Responses API with vLLM: + +```bash +# Basic serving +vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct + +# With tool calling support +vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --enable-auto-tool-choice \ + --tool-call-parser hermes + +# With MCP tool server +vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ + --enable-auto-tool-choice \ + --tool-call-parser hermes \ + --tool-server-url http://localhost:3001/sse +``` + +Then use the OpenAI Python SDK to make requests: + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8000/v1") + +# Non-streaming +response = client.responses.create( + model="meta-llama/Llama-4-Maverick-17B-128E-Instruct", + input="What is the capital of France?", +) +print(response.output_text) + +# Streaming +stream = client.responses.create( + model="meta-llama/Llama-4-Maverick-17B-128E-Instruct", + input="Explain quicksort step by step.", + stream=True, +) +for event in stream: + if event.type == "response.output_text.delta": + print(event.delta, end="", flush=True) + +# With function calling +response = client.responses.create( + model="meta-llama/Llama-4-Maverick-17B-128E-Instruct", + input="What is the weather in San Francisco?", + tools=[{ + "type": "function", + "name": "get_weather", + "description": "Get the current weather for a location.", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string"} + }, + "required": ["location"] + } + }], +) +``` + +## Future Work + +- **Response storage with eviction policies**: Currently, stored responses are held in memory with no eviction. We plan to add configurable TTLs and storage backends. +- **Expanded tool support**: Adding more built-in tool types and improving MCP server discovery. +- **Parallel tool calls**: Improving support for models that emit multiple tool calls in a single turn. +- **Optimized multi-turn performance**: Leveraging prefix caching more aggressively across tool-calling turns to reduce redundant computation. + +To see more details about the future work and explore opportunities to contribute, please see this vLLM feature development map: https://github.com/vllm-project/vllm/issues/34857 + +## Acknowledgements + +This work was a collaboration across the vLLM community. Thanks to all contributors who helped design and implement the Responses API and MCP integration. + +**Meta**: Andrew Xia, Daniel Salib, Ye Hu, Alec Solder, Ye (Charlotte Qi) + +**vLLM**: Chauncey Jiang From 01300cb5c3343b6ba6275b2ad5082c82a2de5700 Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Wed, 11 Mar 2026 10:09:04 -0700 Subject: [PATCH 3/7] more Signed-off-by: Andrew Xia --- _posts/2026-03-11-responses-api.md | 34 ++++++++++++++++++------------ 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/_posts/2026-03-11-responses-api.md b/_posts/2026-03-11-responses-api.md index eda89742..4a7b3cee 100644 --- a/_posts/2026-03-11-responses-api.md +++ b/_posts/2026-03-11-responses-api.md @@ -1,6 +1,6 @@ --- layout: post -title: "Enabling the Responses API and MCP on vLLM" +title: "Enabling Responses API and MCP on vLLM" author: "Meta" image: /assets/logos/vllm-logo-text-light.png --- @@ -15,6 +15,10 @@ This blog post covers: ## The Responses API +**Responses API** is a modern interface for interacting with large language models that unifies text generation, multimodal inputs, and tool use into a single API primitive. Introduced as the successor to earlier interfaces like Chat Completions and Assistants, it provides a flexible abstraction for building agentic applications—allowing models to generate structured outputs, call tools, maintain conversation state, and integrate external data sources in one request. The API treats a “response” as the fundamental unit of interaction, combining inputs, model reasoning, tool calls, and outputs into a structured object. Developers can learn more about the official specification in the OpenAI Responses API documentation (https://developers.openai.com/api/reference/resources/responses +). At the same time, efforts like OpenResponses Initiative (https://www.openresponses.org/ +) aim to define an open, provider-agnostic standard inspired by this interface, enabling interoperable tooling and reducing vendor lock-in across LLM platforms. + ### Endpoint Overview vLLM exposes three endpoints under the Responses API: @@ -37,7 +41,6 @@ curl -X POST http://localhost:8000/v1/responses \ -d '{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", "input": "What is the capital of France?", - "stream": false }' ``` @@ -65,27 +68,34 @@ curl -X POST http://localhost:8000/v1/responses \ }' ``` +See https://www.openresponses.org/specification#streaming for more details on the streaming protocol and the types of events that are streamed. + ### Supported Features The vLLM Responses API supports a broad set of features: -**Function Calling.** Tools of type `function` can be defined in the `tools` list. The model's output is parsed for tool calls using configurable tool parsers (Hermes, Llama, Mistral, etc.). The `tool_choice` parameter supports `"auto"`, `"none"`, `"required"`, or a named function. When the model emits a function call, it is returned as a `function_call` output item with streaming events `response.function_call_arguments.delta` and `response.function_call_arguments.done`. +**Tool Calling (Function and MCP)** Tools of type `function` can be defined in the `tools` list. The model's output is parsed for tool calls using configurable tool parsers (Hermes, Llama, Mistral, etc.). The `tool_choice` parameter supports `"auto"`, `"none"`, `"required"`, or a named function. When the model emits a function call, it is returned as a `function_call` output item with streaming events `response.function_call_arguments.delta` and `response.function_call_arguments.done`. + +TODO -**Reasoning.** The `reasoning` parameter (with an `effort` field) enables chain-of-thought reasoning. Reasoning content is tracked separately from regular output and appears as `ResponseReasoningItem` output items. Streaming emits `response.reasoning_text.delta` and `response.reasoning_text.done` events, allowing clients to display the model's thinking process in real time. +**Reasoning.** The `reasoning` parameter (with an `effort` field) enables chain-of-thought reasoning. Reasoning content is tracked separately from regular output and appears as `ResponseReasoningItem` output items. Streaming emits `response.reasoning_text.delta` and `response.reasoning_text.done` events, allowing clients to display the model's thinking process in real time (see https://github.com/vllm-project/vllm/pull/29947) -**Structured Output.** The `text.format` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. +**Structured Output.** The `structured_outputs` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. When a choice is specified, vLLM will only output in final output from the options listed. (see https://github.com/vllm-project/vllm/pull/33709). **Logprobs.** When `include` contains `"message.output_text.logprobs"`, the response includes per-token log probabilities. The `top_logprobs` parameter controls how many top alternatives are returned per token position. **Background Mode.** Setting `background: true` (with `store: true`) queues the response for asynchronous generation. The response is returned immediately with status `"queued"` and can be polled or retrieved later via `GET /v1/responses/{response_id}`. Background responses can be cancelled via the cancel endpoint. -**Conversation Continuation.** The `previous_response_id` field chains responses together for multi-turn conversations. The previous response's output items are prepended to the new request's input, enabling stateful multi-turn dialogue without the client needing to manage conversation history. + + **vLLM-Specific Extensions.** Beyond the standard API, vLLM adds parameters for `priority` (request scheduling priority), `cache_salt` (prefix cache isolation), `seed` (deterministic sampling), `repetition_penalty`, custom `stop` sequences, and `enable_response_messages` (returns raw prompt and output token IDs for debugging). +**Debugging** We also have implemented the ability to return raw input and output tokens for responsesAPI. (See https://github.com/vllm-project/vllm/pull/29549) + ## MCP: Model Context Protocol Integration -The **Model Context Protocol (MCP)** allows LLMs to call external tools during generation. vLLM implements MCP as a first-class feature of the Responses API: when a model generates a tool call, vLLM intercepts it, calls the appropriate MCP tool server, and feeds the result back to the model for the next turn of generation -- all within a single API request. +The **Model Context Protocol (MCP)** allows LLMs to call external tools during generation, with vLLM handling the tool calling instead of function tools, in which the client is responsible for handling tool calls. vLLM implements MCP as a first-class feature of the Responses API: when a model generates a tool call, vLLM intercepts it, calls the appropriate MCP tool server, and feeds the result back to the model for the next turn of generation -- all within a single API request. ### Built-in Tools @@ -103,8 +113,6 @@ vLLM provides two `ToolServer` implementations: **`MCPToolServer`**: Connects to external MCP-compatible tool servers over SSE. Multiple servers can be specified via comma-separated URLs. Each server exposes its tools via the MCP protocol, and vLLM discovers available tools at startup via `session.list_tools()`. Tool sessions are created per-request with unique session IDs. -**`DemoToolServer`**: A lightweight local alternative that uses built-in tool implementations (Exa-based web search, Docker-based Python execution) without requiring an external MCP server. This is useful for development and testing. - ```bash # Starting vLLM with an MCP tool server vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ @@ -256,10 +264,10 @@ response = client.responses.create( ## Future Work -- **Response storage with eviction policies**: Currently, stored responses are held in memory with no eviction. We plan to add configurable TTLs and storage backends. -- **Expanded tool support**: Adding more built-in tool types and improving MCP server discovery. -- **Parallel tool calls**: Improving support for models that emit multiple tool calls in a single turn. -- **Optimized multi-turn performance**: Leveraging prefix caching more aggressively across tool-calling turns to reduce redundant computation. +- **Offloading response storage & API Layer**: Currently, stored responses are held in memory with no eviction. We would like to support offloading responsesAPI state management to a third party database, and potentially offload the API layer outside of the core vLLM engine (see https://github.com/vllm-project/vllm/issues/26934) +- **Expanded tool support**: Add MCP support for all tools. (See https://github.com/vllm-project/vllm/issues/30115) +- **Open Responses Conformity**: OpenResponses (https://www.openresponses.org/ +), launched in January 2026, is an open initiative to standardize a vendor-neutral API for LLM interactions based on the Responses API abstraction, covering structured outputs, tool calls, multimodal inputs, and reasoning traces. For vLLM, supporting OpenResponses enables compatibility with a growing ecosystem of agent frameworks and SDKs while giving users a portable interface that works across both hosted APIs and open-source model deployments. To see more details about the future work and explore opportunities to contribute, please see this vLLM feature development map: https://github.com/vllm-project/vllm/issues/34857 From f1923268540c5f109b4156fb7ad09e8c265a4459 Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Tue, 17 Mar 2026 09:23:40 -0700 Subject: [PATCH 4/7] more Signed-off-by: Andrew Xia --- _posts/2026-03-11-responses-api.md | 38 +++++++++++++++--------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/_posts/2026-03-11-responses-api.md b/_posts/2026-03-11-responses-api.md index 4a7b3cee..bb016504 100644 --- a/_posts/2026-03-11-responses-api.md +++ b/_posts/2026-03-11-responses-api.md @@ -39,7 +39,7 @@ In non-streaming mode, vLLM generates the full response before returning it. The curl -X POST http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ - "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", + "model": "Qwen/Qwen3-8B", "input": "What is the capital of France?", }' ``` @@ -62,7 +62,7 @@ This event structure matches the OpenAI Responses API specification, making vLLM curl -X POST http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ - "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", + "model": "Qwen/Qwen3-8B", "input": "Explain quicksort in one paragraph.", "stream": true }' @@ -115,7 +115,7 @@ vLLM provides two `ToolServer` implementations: ```bash # Starting vLLM with an MCP tool server -vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ +vllm serve Qwen/Qwen3-8B \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --tool-server-url http://localhost:3001/sse @@ -204,16 +204,15 @@ For multi-turn tool-calling interactions, the `TurnMetrics` class tracks per-tur To use the Responses API with vLLM: ```bash -# Basic serving -vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct - -# With tool calling support -vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ - --enable-auto-tool-choice \ - --tool-call-parser hermes +# with Tool calling and reasoning +vllm serve Qwen/Qwen3-8B \ +--reasoning-parser qwen3 \ +--tool-call-parser qwen3 \ +--enable-auto-tool-choice # With MCP tool server -vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \ +# TODO: Kimi K2 here +vllm serve Qwen/Qwen3-8B \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --tool-server-url http://localhost:3001/sse @@ -228,24 +227,23 @@ client = OpenAI(base_url="http://localhost:8000/v1") # Non-streaming response = client.responses.create( - model="meta-llama/Llama-4-Maverick-17B-128E-Instruct", + model="Qwen/Qwen3-8B", input="What is the capital of France?", ) print(response.output_text) # Streaming stream = client.responses.create( - model="meta-llama/Llama-4-Maverick-17B-128E-Instruct", + model="Qwen/Qwen3-8B", input="Explain quicksort step by step.", stream=True, ) for event in stream: - if event.type == "response.output_text.delta": - print(event.delta, end="", flush=True) + print(event) # With function calling response = client.responses.create( - model="meta-llama/Llama-4-Maverick-17B-128E-Instruct", + model="Qwen/Qwen3-8B", input="What is the weather in San Francisco?", tools=[{ "type": "function", @@ -273,8 +271,10 @@ To see more details about the future work and explore opportunities to contribut ## Acknowledgements -This work was a collaboration across the vLLM community. Thanks to all contributors who helped design and implement the Responses API and MCP integration. +This work was a collaboration across the vLLM community. Thanks to all contributors who helped design and implement the Responses API and MCP integration, including the following (but not limited to): + +**Meta**: Andrew Xia, Daniel Salib, Ye Hu, Zhiwei Zhao, Alec Solder, Ye (Charlotte) Qi -**Meta**: Andrew Xia, Daniel Salib, Ye Hu, Alec Solder, Ye (Charlotte Qi) +**DaoCloud**: Chauncey Jiang -**vLLM**: Chauncey Jiang +**RedHat**: Flora Feng, Ben Browning, Michael Goin From 84d605bb664f43b27a5eb23b75693f2b02007c9b Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Tue, 17 Mar 2026 10:01:13 -0700 Subject: [PATCH 5/7] more Signed-off-by: Andrew Xia --- _posts/2026-03-11-responses-api.md | 69 +++++++++++++++++++----------- 1 file changed, 43 insertions(+), 26 deletions(-) diff --git a/_posts/2026-03-11-responses-api.md b/_posts/2026-03-11-responses-api.md index bb016504..f299e187 100644 --- a/_posts/2026-03-11-responses-api.md +++ b/_posts/2026-03-11-responses-api.md @@ -90,8 +90,9 @@ TODO **vLLM-Specific Extensions.** Beyond the standard API, vLLM adds parameters for `priority` (request scheduling priority), `cache_salt` (prefix cache isolation), `seed` (deterministic sampling), `repetition_penalty`, custom `stop` sequences, and `enable_response_messages` (returns raw prompt and output token IDs for debugging). +TODO: add link -**Debugging** We also have implemented the ability to return raw input and output tokens for responsesAPI. (See https://github.com/vllm-project/vllm/pull/29549) +**Debugging** We also have implemented the ability to return raw input and output tokens for responsesAPI, you can enable this by using the `enable_response_messages` flag. (See https://github.com/vllm-project/vllm/pull/29549) ## MCP: Model Context Protocol Integration @@ -147,13 +148,13 @@ response.mcp_call.completed -- tool execution result is available This allows clients to display the model's tool interactions in real time, showing what tool is being called, with what arguments, and what the result was. -## Context Architecture: HarmonyContext vs. ParsableContext +## Context Architecture -A key design decision in the Responses API is how to manage the conversation state during multi-turn tool-calling loops. vLLM implements two context architectures to support different model families. +A key design decision in the Responses API is how to manage the conversation state during multi-turn tool-calling loops. vLLM implements the following context architectures to support different model families. ### HarmonyContext (GPT-OSS Models) -`HarmonyContext` is designed for GPT-OSS models that use OpenAI's Harmony message format. These models use a channel-based parsing system where the model's output is split into channels (`analysis` for reasoning, `commentary`, and `final` for the actual response). The context tracks messages in the Harmony `Message` format and uses the Harmony tokenizer's `render_for_completion()` to produce token IDs for the next turn. +`HarmonyContext` is designed for GPT-OSS models that use OpenAI's Harmony message format. See [OpenAI's harmony guide](https://developers.openai.com/cookbook/articles/openai-harmony) for more context. These models use a channel-based parsing system where the model's output is split into channels (`analysis` for reasoning, `commentary`, and `final` for the actual response). The context tracks messages in the Harmony `Message` format and uses the Harmony tokenizer's `render_for_completion()` to produce token IDs for the next turn. Key characteristics: - Uses `openai_harmony` message types (`Author`, `Message`, `Role`, `StreamState`, `TextContent`) @@ -163,7 +164,7 @@ Key characteristics: `StreamingHarmonyContext` extends this for token-by-token streaming, processing each token through the Harmony parser and tracking parser state transitions to emit the correct streaming events. -### ParsableContext (All Other Models) +### ParsableContext (MCP for All Other Models) `ParsableContext` is the context for non-GPT-OSS models (Llama, Mistral, Qwen, etc.). It uses vLLM's standard chat template system to render conversations and parses tool calls from the model output using configurable tool parsers. @@ -171,34 +172,29 @@ Key characteristics: - Uses `ResponseInputOutputItem` types from the OpenAI SDK (e.g., `ResponseFunctionToolCall`, `ResponseFunctionToolCallOutputItem`) - Tool calls are identified by the `name` field matching built-in tool names (`code_interpreter`, `web_search_preview`, `container`) - Prompt rendering uses vLLM's chat template system via `_render_next_turn()` -- Supports the `VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT` environment variable to enable this context +- Use the `VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT` environment variable to enable this context ### SimpleContext -For non-tool-calling scenarios, `SimpleContext` provides a lightweight context that accumulates raw text and token IDs without any parsing overhead. It is the default for models that do not have tool use enabled. +For non-MCP-tool-calling scenarios, `SimpleContext` provides a lightweight context that accumulates raw text and token IDs without any parsing overhead. It is the default for models that do not have tool use enabled. -### Choosing the Right Context +Eventually, all three context architectures will be merged into a single, unified context architecture. -The context is selected automatically based on the model type and request configuration: - -| Condition | Context Used | -|---|---| -| GPT-OSS model, streaming | `StreamingHarmonyContext` | -| GPT-OSS model, non-streaming | `HarmonyContext` | -| Non-GPT-OSS, tools enabled, experimental parser | `ParsableContext` | -| Non-GPT-OSS, no tools or simple request | `SimpleContext` | - -## Token Usage Tracking +## Metrics & Token Usage Tracking The Responses API provides detailed token usage information in the response, including: - `input_tokens`: Total prompt tokens across all turns - `output_tokens`: Total generated tokens - `input_tokens_details.cached_tokens`: Tokens served from the prefix cache -- `output_tokens_details.reasoning_tokens`: Tokens spent on reasoning (chain-of-thought) +- `output_tokens_details.reasoning_tokens`: Tokens spent on reasoning (chain-of-thought) ([PR](https://github.com/vllm-project/vllm/pull/33513)) For multi-turn tool-calling interactions, the `TurnMetrics` class tracks per-turn metrics. Tool output tokens are calculated as the difference between consecutive turns' prompt sizes minus the previous turn's output, capturing the token cost of tool results injected between turns. +## Evals + +With vLLM's ResponsesAPI implementation, we were able to replicate Kimi K2's HLE score of 23.9. We used the open source HLE test harness, used OpenAI's o3-mini as a judge. We also ran GPT-OSS against vLLM responsesAPI with MCP tools (including browser, python, and container). With high reasoning on GPT-OSS 120B, we achieved a score of 0.97 on AIME25, which matches OpenAI's GPTOSS model card. + ## Getting Started To use the Responses API with vLLM: @@ -209,13 +205,6 @@ vllm serve Qwen/Qwen3-8B \ --reasoning-parser qwen3 \ --tool-call-parser qwen3 \ --enable-auto-tool-choice - -# With MCP tool server -# TODO: Kimi K2 here -vllm serve Qwen/Qwen3-8B \ - --enable-auto-tool-choice \ - --tool-call-parser hermes \ - --tool-server-url http://localhost:3001/sse ``` Then use the OpenAI Python SDK to make requests: @@ -260,6 +249,34 @@ response = client.responses.create( ) ``` +To use our MCP server: +```bash +# With MCP tool server +# see https://github.com/vllm-project/vllm/pull/29798 +VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1 \ +vllm serve moonshotai/Kimi-K2-Thinking \ +--trust-remote-code \ +--tensor-parallel-size 8 \ +--enable-auto-tool-choice \ --tool-call-parser kimi_k2 \ +--reasoning-parser kimi_k2 \ +--tool-server=localhost:8081/container,localhost:8081/browser,localhost:8081/python + + +# with MCP calling +curl -X POST "http://localhost:8000/v1/responses" -H "Content-Type: application/json" -H "Authorization: Bearer dummy-api-key" -d '{ + "model": "moonshotai/Kimi-K2-Thinking", + "input": "Multiply 64548*15151 using the python tool.", + "tools": [ + { + "type": "mcp", + "server_label": "code_interpreter", + "headers": {"test": "test"}, + "server_url": "IGNORED" + } + ] + }' +``` + ## Future Work - **Offloading response storage & API Layer**: Currently, stored responses are held in memory with no eviction. We would like to support offloading responsesAPI state management to a third party database, and potentially offload the API layer outside of the core vLLM engine (see https://github.com/vllm-project/vllm/issues/26934) From bad389386aeefe4abf506bd4dc9f7a1a6f3643e0 Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Tue, 17 Mar 2026 10:09:27 -0700 Subject: [PATCH 6/7] mroe Signed-off-by: Andrew Xia --- ...ses-api.md => 2026-03-16-responses-api.md} | 28 ++++++------------- 1 file changed, 8 insertions(+), 20 deletions(-) rename _posts/{2026-03-11-responses-api.md => 2026-03-16-responses-api.md} (91%) diff --git a/_posts/2026-03-11-responses-api.md b/_posts/2026-03-16-responses-api.md similarity index 91% rename from _posts/2026-03-11-responses-api.md rename to _posts/2026-03-16-responses-api.md index f299e187..dfc00a10 100644 --- a/_posts/2026-03-11-responses-api.md +++ b/_posts/2026-03-16-responses-api.md @@ -11,12 +11,12 @@ This blog post covers: - **The Responses API implementation** in vLLM: endpoint design, streaming and non-streaming modes, and the full set of supported features - **MCP (Model Context Protocol) integration**: how vLLM connects to external tool servers and executes tool calls during generation -- **Two context architectures**: HarmonyContext for GPT-OSS models and ParsableContext for all other models +- **Context architectures**: HarmonyContext for GPT-OSS models and ParsableContext/SimpleContext for all other models ## The Responses API -**Responses API** is a modern interface for interacting with large language models that unifies text generation, multimodal inputs, and tool use into a single API primitive. Introduced as the successor to earlier interfaces like Chat Completions and Assistants, it provides a flexible abstraction for building agentic applications—allowing models to generate structured outputs, call tools, maintain conversation state, and integrate external data sources in one request. The API treats a “response” as the fundamental unit of interaction, combining inputs, model reasoning, tool calls, and outputs into a structured object. Developers can learn more about the official specification in the OpenAI Responses API documentation (https://developers.openai.com/api/reference/resources/responses -). At the same time, efforts like OpenResponses Initiative (https://www.openresponses.org/ +**Responses API** is a modern interface for interacting with large language models that unifies text generation, multimodal inputs, and tool use into a single API primitive. Introduced as the successor to earlier interfaces like Chat Completions and Assistants, it provides a flexible abstraction for building agentic applications—allowing models to generate structured outputs, call tools, maintain conversation state, and integrate external data sources in one request. The API treats a “response” as the fundamental unit of interaction, combining inputs, model reasoning, tool calls, and outputs into a structured object. Developers can learn more about the official specification in the OpenAI Responses [API documentation](https://developers.openai.com/api/reference/resources/responses +). At the same time, efforts like [OpenResponses Initiative](https://www.openresponses.org/ ) aim to define an open, provider-agnostic standard inspired by this interface, enabling interoperable tooling and reducing vendor lock-in across LLM platforms. ### Endpoint Overview @@ -68,11 +68,13 @@ curl -X POST http://localhost:8000/v1/responses \ }' ``` -See https://www.openresponses.org/specification#streaming for more details on the streaming protocol and the types of events that are streamed. +See [OpenResponses](https://www.openresponses.org/specification#streaming) for more details on the streaming protocol and the types of events that are streamed. ### Supported Features -The vLLM Responses API supports a broad set of features: +The vLLM Responses API supports a broad set of features. To see the vLLM specific implementation of responsesAPI: +- ResponsesRequest defined [here](https://github.com/vllm-project/vllm/blob/4ed51308c8826619459be858a6dc4333206f41c1/vllm/entrypoints/openai/responses/protocol.py#L140) +- ResponsesResponse is defined [here](https://github.com/vllm-project/vllm/blob/4ed51308c8826619459be858a6dc4333206f41c1/vllm/entrypoints/openai/responses/protocol.py#L468) **Tool Calling (Function and MCP)** Tools of type `function` can be defined in the `tools` list. The model's output is parsed for tool calls using configurable tool parsers (Hermes, Llama, Mistral, etc.). The `tool_choice` parameter supports `"auto"`, `"none"`, `"required"`, or a named function. When the model emits a function call, it is returned as a `function_call` output item with streaming events `response.function_call_arguments.delta` and `response.function_call_arguments.done`. @@ -80,17 +82,11 @@ TODO **Reasoning.** The `reasoning` parameter (with an `effort` field) enables chain-of-thought reasoning. Reasoning content is tracked separately from regular output and appears as `ResponseReasoningItem` output items. Streaming emits `response.reasoning_text.delta` and `response.reasoning_text.done` events, allowing clients to display the model's thinking process in real time (see https://github.com/vllm-project/vllm/pull/29947) -**Structured Output.** The `structured_outputs` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. When a choice is specified, vLLM will only output in final output from the options listed. (see https://github.com/vllm-project/vllm/pull/33709). +**Structured Output.** The `structured_outputs` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. When a choice is specified, vLLM will only output in final output from the options listed. (see this [PR](https://github.com/vllm-project/vllm/pull/33709) for more context). **Logprobs.** When `include` contains `"message.output_text.logprobs"`, the response includes per-token log probabilities. The `top_logprobs` parameter controls how many top alternatives are returned per token position. -**Background Mode.** Setting `background: true` (with `store: true`) queues the response for asynchronous generation. The response is returned immediately with status `"queued"` and can be polled or retrieved later via `GET /v1/responses/{response_id}`. Background responses can be cancelled via the cancel endpoint. - - - - **vLLM-Specific Extensions.** Beyond the standard API, vLLM adds parameters for `priority` (request scheduling priority), `cache_salt` (prefix cache isolation), `seed` (deterministic sampling), `repetition_penalty`, custom `stop` sequences, and `enable_response_messages` (returns raw prompt and output token IDs for debugging). -TODO: add link **Debugging** We also have implemented the ability to return raw input and output tokens for responsesAPI, you can enable this by using the `enable_response_messages` flag. (See https://github.com/vllm-project/vllm/pull/29549) @@ -114,14 +110,6 @@ vLLM provides two `ToolServer` implementations: **`MCPToolServer`**: Connects to external MCP-compatible tool servers over SSE. Multiple servers can be specified via comma-separated URLs. Each server exposes its tools via the MCP protocol, and vLLM discovers available tools at startup via `session.list_tools()`. Tool sessions are created per-request with unique session IDs. -```bash -# Starting vLLM with an MCP tool server -vllm serve Qwen/Qwen3-8B \ - --enable-auto-tool-choice \ - --tool-call-parser hermes \ - --tool-server-url http://localhost:3001/sse -``` - ### Agentic Loop The core of MCP integration is the agentic loop in `_generate_with_builtin_tools`. This loop: From 6fdb94ca112409d28811d425b28c4d5af17dd9a0 Mon Sep 17 00:00:00 2001 From: Andrew Xia Date: Tue, 17 Mar 2026 10:29:18 -0700 Subject: [PATCH 7/7] ready? Signed-off-by: Andrew Xia --- _posts/2026-03-16-responses-api.md | 56 ++++++++++++++++-------------- 1 file changed, 29 insertions(+), 27 deletions(-) diff --git a/_posts/2026-03-16-responses-api.md b/_posts/2026-03-16-responses-api.md index dfc00a10..9a7fc433 100644 --- a/_posts/2026-03-16-responses-api.md +++ b/_posts/2026-03-16-responses-api.md @@ -1,27 +1,25 @@ --- layout: post -title: "Enabling Responses API and MCP on vLLM" +title: "Enabling ResponsesAPI and MCP on vLLM" author: "Meta" image: /assets/logos/vllm-logo-text-light.png --- -The OpenAI **Responses API** is the successor to the Chat Completions API, designed to support agentic workflows with built-in tool use, multi-turn conversation management, and streaming. We have implemented the Responses API in vLLM, enabling any model served by vLLM to participate in agentic pipelines that call tools, execute code, search the web, and reason through complex tasks -- all through a single `POST /v1/responses` endpoint. +The OpenAI **ResponsesAPI** is the successor to the Chat Completions API, designed to support agentic workflows with built-in tool use, multi-turn conversation management, and streaming. We have implemented the ResponsesAPI in vLLM, enabling any model served by vLLM to participate in agentic pipelines that call tools, execute code, search the web, and reason through complex tasks -- all through a single `POST /v1/responses` endpoint. This blog post covers: -- **The Responses API implementation** in vLLM: endpoint design, streaming and non-streaming modes, and the full set of supported features +- **ResponsesAPI implementation** in vLLM: endpoint design, streaming and non-streaming modes, and the full set of supported features - **MCP (Model Context Protocol) integration**: how vLLM connects to external tool servers and executes tool calls during generation - **Context architectures**: HarmonyContext for GPT-OSS models and ParsableContext/SimpleContext for all other models -## The Responses API +## ResponsesAPI -**Responses API** is a modern interface for interacting with large language models that unifies text generation, multimodal inputs, and tool use into a single API primitive. Introduced as the successor to earlier interfaces like Chat Completions and Assistants, it provides a flexible abstraction for building agentic applications—allowing models to generate structured outputs, call tools, maintain conversation state, and integrate external data sources in one request. The API treats a “response” as the fundamental unit of interaction, combining inputs, model reasoning, tool calls, and outputs into a structured object. Developers can learn more about the official specification in the OpenAI Responses [API documentation](https://developers.openai.com/api/reference/resources/responses -). At the same time, efforts like [OpenResponses Initiative](https://www.openresponses.org/ -) aim to define an open, provider-agnostic standard inspired by this interface, enabling interoperable tooling and reducing vendor lock-in across LLM platforms. +**ResponsesAPI** is a modern interface for interacting with large language models that unifies text generation, multimodal inputs, and tool use into a single API primitive. Introduced as the successor to earlier interfaces like Chat Completions and Assistants, it provides a flexible abstraction for building agentic applications—allowing models to generate structured outputs, call tools, maintain conversation state, and integrate external data sources in one request. The API treats a “response” as the fundamental unit of interaction, combining inputs, model reasoning, tool calls, and outputs into a structured object. Developers can learn more about the official specification in the OpenAI Responses [API documentation](https://developers.openai.com/api/reference/resources/responses). At the same time, efforts like the [OpenResponses Initiative](https://www.openresponses.org/) aim to define an open, provider-agnostic standard inspired by this interface, enabling interoperable tooling and reducing vendor lock-in across LLM platforms. ### Endpoint Overview -vLLM exposes three endpoints under the Responses API: +vLLM exposes three endpoints under ResponsesAPI: | Endpoint | Method | Description | |---|---|---| @@ -40,7 +38,7 @@ curl -X POST http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", - "input": "What is the capital of France?", + "input": "What is the capital of France?" }' ``` @@ -56,7 +54,7 @@ When `stream: true`, vLLM emits events as SSE with monotonically increasing sequ - `response.output_item.done` -- the output item is finalized 3. `response.completed` -- the full response with usage statistics -This event structure matches the OpenAI Responses API specification, making vLLM a drop-in replacement for clients already using the Responses API. +This event structure matches the OpenAI ResponsesAPI specification, making vLLM a drop-in replacement for clients already using ResponsesAPI. ```bash curl -X POST http://localhost:8000/v1/responses \ @@ -72,27 +70,31 @@ See [OpenResponses](https://www.openresponses.org/specification#streaming) for m ### Supported Features -The vLLM Responses API supports a broad set of features. To see the vLLM specific implementation of responsesAPI: +The vLLM ResponsesAPI supports a broad set of features. To see the vLLM-specific implementation of ResponsesAPI: - ResponsesRequest defined [here](https://github.com/vllm-project/vllm/blob/4ed51308c8826619459be858a6dc4333206f41c1/vllm/entrypoints/openai/responses/protocol.py#L140) - ResponsesResponse is defined [here](https://github.com/vllm-project/vllm/blob/4ed51308c8826619459be858a6dc4333206f41c1/vllm/entrypoints/openai/responses/protocol.py#L468) -**Tool Calling (Function and MCP)** Tools of type `function` can be defined in the `tools` list. The model's output is parsed for tool calls using configurable tool parsers (Hermes, Llama, Mistral, etc.). The `tool_choice` parameter supports `"auto"`, `"none"`, `"required"`, or a named function. When the model emits a function call, it is returned as a `function_call` output item with streaming events `response.function_call_arguments.delta` and `response.function_call_arguments.done`. +**Tool Calling (Function and MCP)** +ResponsesAPI distinguishes between two tool types: -TODO +- **Function tools** (`"type": "function"`): Tools defined inline with a JSON schema. The model generates a `function_call` output item, but the **client** is responsible for executing the function and returning the result. This follows a client-side execution model. See [this script](https://github.com/vllm-project/vllm/blob/4ed51308c8826619459be858a6dc4333206f41c1/examples/online_serving/openai_responses_client_with_tools.py) for an example of multi-turn function tools. +- **MCP tools** (`"type": "mcp"`): Tools hosted on external MCP servers. **vLLM itself** acts as the MCP client—it intercepts the tool call, executes it against the MCP server, injects the result back into the conversation, and continues generating. This happens entirely server-side within a single API request (see the [MCP section](#mcp-model-context-protocol-integration) below). See [this script](https://github.com/vllm-project/vllm/blob/4ed51308c8826619459be858a6dc4333206f41c1/examples/online_serving/openai_responses_client_with_mcp_tools.py) for an example of calling MCP tools with GPT-OSS. + +Tools can be defined in the `tools` list in the request. The model's output is parsed for tool calls using configurable tool parsers, which is set with the `--tool-call-parser` flag. **Reasoning.** The `reasoning` parameter (with an `effort` field) enables chain-of-thought reasoning. Reasoning content is tracked separately from regular output and appears as `ResponseReasoningItem` output items. Streaming emits `response.reasoning_text.delta` and `response.reasoning_text.done` events, allowing clients to display the model's thinking process in real time (see https://github.com/vllm-project/vllm/pull/29947) -**Structured Output.** The `structured_outputs` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. When a choice is specified, vLLM will only output in final output from the options listed. (see this [PR](https://github.com/vllm-project/vllm/pull/33709) for more context). +**Structured Output.** The `structured_outputs` field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. When a choice is specified, vLLM will only produce output from the options listed. (see this [PR](https://github.com/vllm-project/vllm/pull/33709) for more context). **Logprobs.** When `include` contains `"message.output_text.logprobs"`, the response includes per-token log probabilities. The `top_logprobs` parameter controls how many top alternatives are returned per token position. **vLLM-Specific Extensions.** Beyond the standard API, vLLM adds parameters for `priority` (request scheduling priority), `cache_salt` (prefix cache isolation), `seed` (deterministic sampling), `repetition_penalty`, custom `stop` sequences, and `enable_response_messages` (returns raw prompt and output token IDs for debugging). -**Debugging** We also have implemented the ability to return raw input and output tokens for responsesAPI, you can enable this by using the `enable_response_messages` flag. (See https://github.com/vllm-project/vllm/pull/29549) +**Debugging.** We have also implemented the ability to return raw input and output tokens for ResponsesAPI. You can enable this by using the `enable_response_messages` flag. (See https://github.com/vllm-project/vllm/pull/29549) ## MCP: Model Context Protocol Integration -The **Model Context Protocol (MCP)** allows LLMs to call external tools during generation, with vLLM handling the tool calling instead of function tools, in which the client is responsible for handling tool calls. vLLM implements MCP as a first-class feature of the Responses API: when a model generates a tool call, vLLM intercepts it, calls the appropriate MCP tool server, and feeds the result back to the model for the next turn of generation -- all within a single API request. +The **Model Context Protocol (MCP)** allows LLMs to call external tools during generation, with vLLM handling the tool execution server-side -- unlike function tools, where the client is responsible for handling tool calls. vLLM implements MCP as a first-class feature of ResponsesAPI: when a model generates a tool call, vLLM intercepts it, calls the appropriate MCP tool server, and feeds the result back to the model for the next turn of generation -- all within a single API request. ### Built-in Tools @@ -106,7 +108,7 @@ vLLM supports three categories of built-in tools: ### Tool Server Architecture -vLLM provides two `ToolServer` implementations: +vLLM provides the following `ToolServer` implementation: **`MCPToolServer`**: Connects to external MCP-compatible tool servers over SSE. Multiple servers can be specified via comma-separated URLs. Each server exposes its tools via the MCP protocol, and vLLM discovers available tools at startup via `session.list_tools()`. Tool sessions are created per-request with unique session IDs. @@ -138,7 +140,7 @@ This allows clients to display the model's tool interactions in real time, showi ## Context Architecture -A key design decision in the Responses API is how to manage the conversation state during multi-turn tool-calling loops. vLLM implements the following context architectures to support different model families. +A key design decision in the ResponsesAPI is how to manage the conversation state during multi-turn tool-calling loops. vLLM implements the following context architectures to support different model families. ### HarmonyContext (GPT-OSS Models) @@ -170,7 +172,7 @@ Eventually, all three context architectures will be merged into a single, unifie ## Metrics & Token Usage Tracking -The Responses API provides detailed token usage information in the response, including: +ResponsesAPI provides detailed token usage information in the response, including: - `input_tokens`: Total prompt tokens across all turns - `output_tokens`: Total generated tokens @@ -181,11 +183,11 @@ For multi-turn tool-calling interactions, the `TurnMetrics` class tracks per-tur ## Evals -With vLLM's ResponsesAPI implementation, we were able to replicate Kimi K2's HLE score of 23.9. We used the open source HLE test harness, used OpenAI's o3-mini as a judge. We also ran GPT-OSS against vLLM responsesAPI with MCP tools (including browser, python, and container). With high reasoning on GPT-OSS 120B, we achieved a score of 0.97 on AIME25, which matches OpenAI's GPTOSS model card. +With vLLM's ResponsesAPI implementation, we were able to replicate Kimi K2's HLE score of 23.9. We used the open-source HLE test harness with OpenAI's o3-mini as a judge. We also ran GPT-OSS against the vLLM ResponsesAPI with MCP tools (including browser, python, and container). With high reasoning on GPT-OSS 120B, we achieved a score of 0.97 on AIME 2025, which matches OpenAI's GPT-OSS model card. ## Getting Started -To use the Responses API with vLLM: +To use the ResponsesAPI with vLLM: ```bash # with Tool calling and reasoning @@ -245,7 +247,8 @@ VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT=1 \ vllm serve moonshotai/Kimi-K2-Thinking \ --trust-remote-code \ --tensor-parallel-size 8 \ ---enable-auto-tool-choice \ --tool-call-parser kimi_k2 \ +--enable-auto-tool-choice \ +--tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --tool-server=localhost:8081/container,localhost:8081/browser,localhost:8081/python @@ -267,16 +270,15 @@ curl -X POST "http://localhost:8000/v1/responses" -H "Content-Type: applicatio ## Future Work -- **Offloading response storage & API Layer**: Currently, stored responses are held in memory with no eviction. We would like to support offloading responsesAPI state management to a third party database, and potentially offload the API layer outside of the core vLLM engine (see https://github.com/vllm-project/vllm/issues/26934) +- **Offloading response storage & API Layer**: Currently, stored responses are held in memory with no eviction. We would like to support offloading ResponsesAPI state management to a third-party database, and potentially offload the API layer outside of the core vLLM engine (see https://github.com/vllm-project/vllm/issues/26934) - **Expanded tool support**: Add MCP support for all tools. (See https://github.com/vllm-project/vllm/issues/30115) -- **Open Responses Conformity**: OpenResponses (https://www.openresponses.org/ -), launched in January 2026, is an open initiative to standardize a vendor-neutral API for LLM interactions based on the Responses API abstraction, covering structured outputs, tool calls, multimodal inputs, and reasoning traces. For vLLM, supporting OpenResponses enables compatibility with a growing ecosystem of agent frameworks and SDKs while giving users a portable interface that works across both hosted APIs and open-source model deployments. +- **Open Responses Conformity**: [OpenResponses](https://www.openresponses.org/), launched in January 2026, is an open initiative to standardize a vendor-neutral API for LLM interactions based on ResponsesAPI abstraction, covering structured outputs, tool calls, multimodal inputs, and reasoning traces. For vLLM, supporting OpenResponses enables compatibility with a growing ecosystem of agent frameworks and SDKs while giving users a portable interface that works across both hosted APIs and open-source model deployments. -To see more details about the future work and explore opportunities to contribute, please see this vLLM feature development map: https://github.com/vllm-project/vllm/issues/34857 +To see more details about the future work and explore opportunities to contribute, please see this vLLM feature development map for H1 2026: https://github.com/vllm-project/vllm/issues/34857 ## Acknowledgements -This work was a collaboration across the vLLM community. Thanks to all contributors who helped design and implement the Responses API and MCP integration, including the following (but not limited to): +This work was a collaboration across the vLLM community. Thanks to all contributors who helped design and implement ResponsesAPI and MCP integration, including the following (but not limited to): **Meta**: Andrew Xia, Daniel Salib, Ye Hu, Zhiwei Zhao, Alec Solder, Ye (Charlotte) Qi