Conversation
There was a problem hiding this comment.
Pull request overview
Updates the continuous batching “long context” demo to use openai/gpt-oss-20b and expands the demo documentation with CPU/GPU/NPU benchmark instructions and reported results.
Changes:
- Switch long-context dataset generation tokenizer default model to
openai/gpt-oss-20b. - Refresh the long context demo README to document the new model export/start flow and add CPU/GPU/NPU benchmark results + comparison tables.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| demos/continuous_batching/long_context/custom_dataset.py | Updates default --model_name used by the tokenizer for dataset generation. |
| demos/continuous_batching/long_context/README.md | Reworks the demo instructions for openai/gpt-oss-20b, adds tabbed CPU/GPU/NPU sections, benchmark outputs, and updated comparison tables. |
Comments suppressed due to low confidence (2)
demos/continuous_batching/long_context/README.md:28
- In the MyST tab-set directive options,
:sync:should be written as:sync: <value>(with a space after the colon).:sync:CPUmay not be parsed as an option by Sphinx/MyST and could break tab syncing.
:::{tab-item} CPU and GPU
:sync:CPU
demos/continuous_batching/long_context/README.md:83
- The command now generates a dataset with
--limit_context_tokens 5000, but the following sentence still says the context is limited to 50000 tokens. Update the text to match the new default/example so readers don't misinterpret the benchmark setup.
python custom_dataset.py --limit_context_tokens 5000
It will create a file called dataset.jsonl with 10 requests of shared context body limited to 50000 tokens.
</details>
| For example: | ||
| ``` | ||
| lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template | ||
| lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=OpenVINO/gpt-oss-20b-int4-ov,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template |
There was a problem hiding this comment.
this benchmark lasts very long, I would consieder droping it, changing task or limiting it
There was a problem hiding this comment.
ok, let's drop accuracy testing from this page. it will be enough to show kv cache usage depending on cache precission.
We should mention about almost identical results and add a link to an article about accuracy testing
| ::: | ||
| :::: | ||
|
|
||
| ## Dataset for experiments |
There was a problem hiding this comment.
this section can be dropped if we switch to vllm benchmarking using the latest version.
To simplify execution in the demo, it can be limitted to linux and docker image
docker run --entrypoint vllm vllm/vllm-openai:v0.16.0 bench serve --backend openai --base-url http://localhost:8000/ --endpoint v3/completions --model ovms-model --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 100 --prefix-repetition-num-prefixes 1 --num-prompts 10 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 --seed 0
| For example: | ||
| ``` | ||
| lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template | ||
| lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=OpenVINO/gpt-oss-20b-int4-ov,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template |
There was a problem hiding this comment.
ok, let's drop accuracy testing from this page. it will be enough to show kv cache usage depending on cache precission.
We should mention about almost identical results and add a link to an article about accuracy testing
| Platform: Intel(R) Core(TM) Ultra 5 338H | ||
| | Context Length (tokens) | TTFT No Caching (ms) | TTFT Prefix Caching (ms) | | ||
| |------------------------|------------------|---------------------| | ||
| | 500 | 1521.75 | 1489.22 | |
There was a problem hiding this comment.
is the gain so small with short prompts
| ``` | ||
|
|
||
|
|
||
| ::::{tab-set} |
There was a problem hiding this comment.
while there is comparition table, we could drop those results for 5k tokens
🛠 Summary
CVS-176666
Changing model used in long context demo to gpt-oss-20b, adding GPU and NPU benchmarks.
🧪 Checklist
``