Long context demo improvements#4027

Open

przepeck wants to merge 9 commits intomainfrom

przepeck/long_context_demo

Collaborator

przepeck commented Mar 2, 2026

🛠 Summary

CVS-176666
Changing model used in long context demo to gpt-oss-20b, adding GPU and NPU benchmarks.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

przepeck added 6 commits

February 11, 2026 11:43


          save

1f6051e


          save

80e97a9


          save

7b6c459


          save

9a3f16e


          save

6c7de30


          benchmarks

43da05f

Copilot AI review requested due to automatic review settings

March 2, 2026 06:40

Copilot started reviewing on behalf of przepeck

March 2, 2026 06:41

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Updates the continuous batching “long context” demo to use openai/gpt-oss-20b and expands the demo documentation with CPU/GPU/NPU benchmark instructions and reported results.

Changes:

Switch long-context dataset generation tokenizer default model to openai/gpt-oss-20b.
Refresh the long context demo README to document the new model export/start flow and add CPU/GPU/NPU benchmark results + comparison tables.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
demos/continuous_batching/long_context/custom_dataset.py	Updates default `--model_name` used by the tokenizer for dataset generation.
demos/continuous_batching/long_context/README.md	Reworks the demo instructions for `openai/gpt-oss-20b`, adds tabbed CPU/GPU/NPU sections, benchmark outputs, and updated comparison tables.

Comments suppressed due to low confidence (2)

demos/continuous_batching/long_context/README.md:28

In the MyST tab-set directive options, :sync: should be written as :sync: <value> (with a space after the colon). :sync:CPU may not be parsed as an option by Sphinx/MyST and could break tab syncing.

:::{tab-item} CPU and GPU
:sync:CPU

demos/continuous_batching/long_context/README.md:83

The command now generates a dataset with --limit_context_tokens 5000, but the following sentence still says the context is limited to 50000 tokens. Update the text to match the new default/example so readers don't misinterpret the benchmark setup.

python custom_dataset.py --limit_context_tokens 5000

It will create a file called dataset.jsonl with 10 requests of shared context body limited to 50000 tokens.

</details>

demos/continuous_batching/long_context/README.md Show resolved Hide resolved

demos/continuous_batching/long_context/README.md Outdated Show resolved Hide resolved

demos/continuous_batching/long_context/README.md Outdated Show resolved Hide resolved

dtrawins reviewed

View reviewed changes

demos/continuous_batching/long_context/README.md Outdated Show resolved Hide resolved


          ov repo models

przepeck changed the title ~~[WIP] Long context demo improvements~~ Long context demo improvements

przepeck requested a review from dkalinowski

March 2, 2026 10:06


          model change

56c3c19

przepeck commented

View reviewed changes

demos/continuous_batching/long_context/README.md

               For example:
               ```
-              lm-eval --model local-chat-completions --tasks longbench_gov_report  --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000  --verbosity DEBUG --seed 1 --apply_chat_template
+              lm-eval --model local-chat-completions --tasks longbench_gov_report  --model_args model=OpenVINO/gpt-oss-20b-int4-ov,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000  --verbosity DEBUG --seed 1 --apply_chat_template

Collaborator Author

przepeck Mar 2, 2026

this benchmark lasts very long, I would consieder droping it, changing task or limiting it

Collaborator

dtrawins Mar 2, 2026

ok, let's drop accuracy testing from this page. it will be enough to show kv cache usage depending on cache precission.
We should mention about almost identical results and add a link to an article about accuracy testing


          tokenizer missmatch fix

6e9b234

dtrawins reviewed

View reviewed changes

demos/continuous_batching/long_context/README.md

+              :::
+              ::::
               ## Dataset for experiments

Collaborator

dtrawins Mar 2, 2026

this section can be dropped if we switch to vllm benchmarking using the latest version.
To simplify execution in the demo, it can be limitted to linux and docker image
docker run --entrypoint vllm vllm/vllm-openai:v0.16.0 bench serve --backend openai --base-url http://localhost:8000/ --endpoint v3/completions --model ovms-model --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 100 --prefix-repetition-num-prefixes 1 --num-prompts 10 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 --seed 0

demos/continuous_batching/long_context/README.md

               For example:
               ```
-              lm-eval --model local-chat-completions --tasks longbench_gov_report  --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000  --verbosity DEBUG --seed 1 --apply_chat_template
+              lm-eval --model local-chat-completions --tasks longbench_gov_report  --model_args model=OpenVINO/gpt-oss-20b-int4-ov,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000  --verbosity DEBUG --seed 1 --apply_chat_template

Collaborator

dtrawins Mar 2, 2026

ok, let's drop accuracy testing from this page. it will be enough to show kv cache usage depending on cache precission.
We should mention about almost identical results and add a link to an article about accuracy testing

demos/continuous_batching/long_context/README.md

+              Platform: Intel(R) Core(TM) Ultra 5 338H
+              | Context Length (tokens) | TTFT No Caching (ms) | TTFT Prefix Caching (ms) |
+              |------------------------|------------------|---------------------|
+              | 500                    | 1521.75          | 1489.22                  |

Collaborator

dtrawins Mar 2, 2026

is the gain so small with short prompts

demos/continuous_batching/long_context/README.md

		```


		::::{tab-set}

Collaborator

dtrawins Mar 2, 2026

while there is comparition table, we could drop those results for 5k tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet