Skip to content

Long context demo improvements#4027

Open
przepeck wants to merge 9 commits intomainfrom
przepeck/long_context_demo
Open

Long context demo improvements#4027
przepeck wants to merge 9 commits intomainfrom
przepeck/long_context_demo

Conversation

@przepeck
Copy link
Collaborator

@przepeck przepeck commented Mar 2, 2026

🛠 Summary

CVS-176666
Changing model used in long context demo to gpt-oss-20b, adding GPU and NPU benchmarks.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

Copilot AI review requested due to automatic review settings March 2, 2026 06:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the continuous batching “long context” demo to use openai/gpt-oss-20b and expands the demo documentation with CPU/GPU/NPU benchmark instructions and reported results.

Changes:

  • Switch long-context dataset generation tokenizer default model to openai/gpt-oss-20b.
  • Refresh the long context demo README to document the new model export/start flow and add CPU/GPU/NPU benchmark results + comparison tables.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
demos/continuous_batching/long_context/custom_dataset.py Updates default --model_name used by the tokenizer for dataset generation.
demos/continuous_batching/long_context/README.md Reworks the demo instructions for openai/gpt-oss-20b, adds tabbed CPU/GPU/NPU sections, benchmark outputs, and updated comparison tables.
Comments suppressed due to low confidence (2)

demos/continuous_batching/long_context/README.md:28

  • In the MyST tab-set directive options, :sync: should be written as :sync: <value> (with a space after the colon). :sync:CPU may not be parsed as an option by Sphinx/MyST and could break tab syncing.
:::{tab-item} CPU and GPU
:sync:CPU

demos/continuous_batching/long_context/README.md:83

  • The command now generates a dataset with --limit_context_tokens 5000, but the following sentence still says the context is limited to 50000 tokens. Update the text to match the new default/example so readers don't misinterpret the benchmark setup.
python custom_dataset.py --limit_context_tokens 5000

It will create a file called dataset.jsonl with 10 requests of shared context body limited to 50000 tokens.

</details>

@przepeck przepeck changed the title [WIP] Long context demo improvements Long context demo improvements Mar 2, 2026
@przepeck przepeck requested a review from dkalinowski March 2, 2026 10:06
For example:
```
lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template
lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=OpenVINO/gpt-oss-20b-int4-ov,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this benchmark lasts very long, I would consieder droping it, changing task or limiting it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's drop accuracy testing from this page. it will be enough to show kv cache usage depending on cache precission.
We should mention about almost identical results and add a link to an article about accuracy testing

:::
::::

## Dataset for experiments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section can be dropped if we switch to vllm benchmarking using the latest version.
To simplify execution in the demo, it can be limitted to linux and docker image
docker run --entrypoint vllm vllm/vllm-openai:v0.16.0 bench serve --backend openai --base-url http://localhost:8000/ --endpoint v3/completions --model ovms-model --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 100 --prefix-repetition-num-prefixes 1 --num-prompts 10 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 --seed 0

For example:
```
lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template
lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=OpenVINO/gpt-oss-20b-int4-ov,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's drop accuracy testing from this page. it will be enough to show kv cache usage depending on cache precission.
We should mention about almost identical results and add a link to an article about accuracy testing

Platform: Intel(R) Core(TM) Ultra 5 338H
| Context Length (tokens) | TTFT No Caching (ms) | TTFT Prefix Caching (ms) |
|------------------------|------------------|---------------------|
| 500 | 1521.75 | 1489.22 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the gain so small with short prompts

```


::::{tab-set}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while there is comparition table, we could drop those results for 5k tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants