Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/advanced_topics.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ ovms_extras_nginx-mtls-auth-readme
```

## CPU Extensions
Implement any CPU layer, that is not support by OpenVINO yet, as a shared library.
Implement any CPU layer, that is not supported by OpenVINO yet, as a shared library.

[Learn more](../src/example/SampleCpuExtension/README.md)

Expand Down
2 changes: 1 addition & 1 deletion docs/clients_genai.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Speech to text API <ovms_docs_rest_api_s2t>
Text to speech API <ovms_docs_rest_api_t2s>
```
## Introduction
Beside Tensorflow Serving API (`/v1`) and KServe API (`/v2`) frontends, the model server supports a range of endpoints for generative use cases (`v3`). They are extendible using MediaPipe graphs.
Besides TensorFlow Serving API (`/v1`) and KServe API (`/v2`) frontends, the model server supports a range of endpoints for generative use cases (`v3`). They are extendible using MediaPipe graphs.
Currently supported endpoints are:

OpenAI compatible endpoints:
Expand Down
2 changes: 1 addition & 1 deletion docs/deploying_server_kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Note that using s3 or minio bucket requires configuring credentials like describ

## Deprecation notice about OpenVINO operator

The dedicated [operator for OpenVINO]((https://operatorhub.io/operator/ovms-operator)) is now deprecated. KServe operator can now support all OVMS use cases including generative models. It provides wider set of features and configuration options. Because KServe is commonly used for other serving runtimes, it gives easier transition and transparent migration.
The dedicated [operator for OpenVINO](https://operatorhub.io/operator/ovms-operator) is now deprecated. KServe operator can now support all OVMS use cases including generative models. It provides wider set of features and configuration options. Because KServe is commonly used for other serving runtimes, it gives easier transition and transparent migration.

## Additional Resources

Expand Down
4 changes: 2 additions & 2 deletions docs/legacy.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ ovms_docs_dag
```

## Stateful models
Implement any CPU layer, that is not support by OpenVINO yet, as a shared library.
Implement any CPU layer, that is not supported by OpenVINO yet, as a shared library.
[Learn more](./stateful_models.md)
**Note:** The use cases from this feature can be addressed in MediaPipe graphs including generative use cases.

## DAG pipelines
The Directed Acyclic Graph (DAG) Scheduler for creating pipeline of models for execution in a single client request.
[Learn model](./dag_scheduler.md)
[Learn more](./dag_scheduler.md)
**Note:** MediaPipe graphs can be a more flexible of pipelines scheduler which can employ various data formats and accelerators.

6 changes: 3 additions & 3 deletions docs/llm/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Overview

With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/openvinotoolkit/openvino.genai) like:
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in its [GenAI Library](https://github.com/openvinotoolkit/openvino.genai) like:
- Continuous Batching
- Paged Attention
- Dynamic Split Fuse
Expand All @@ -22,7 +22,7 @@ The servable types are:
- Visual Language Model Stateful.

First part - Language Model / Visual Language Model - determines whether servable accepts only text or both text and images on the input.
Seconds part - Continuous Batching / Stateful - determines what kind of GenAI pipeline is used as the engine. By default CPU and GPU devices work on Continuous Batching pipelines. NPU device works only on Stateful servable type.
Second part - Continuous Batching / Stateful - determines what kind of GenAI pipeline is used as the engine. By default CPU and GPU devices work on Continuous Batching pipelines. NPU device works only with the Stateful servable type.

User does not have to explicitly select servable type. It is inferred based on model directory contents and selected target device.
Model directory contents determine if model can work only with text or visual input as well. As for target device, setting it to `NPU` will always pick Stateful servable, while any other device will result in deploying Continuous Batching servable.
Expand Down Expand Up @@ -354,7 +354,7 @@ Check [tested models](https://github.com/openvinotoolkit/openvino.genai/blob/mas

### Completions

When sending a request to `/completions` endpoint, model server adds `bos_token_id` during tokenization, so **there is not need to add `bos_token` to the prompt**.
When sending a request to `/completions` endpoint, model server adds `bos_token_id` during tokenization, so **there is no need to add `bos_token` to the prompt**.

### Chat Completions

Expand Down
8 changes: 4 additions & 4 deletions docs/mediapipe.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,15 @@ Following table lists supported tag and packet types in pbtxt graph definition:
|pbtxt line|input/output|tag|packet type|stream name|
|:---|:---|:---|:---|:---|
|input_stream: "a"|input|none|ov::Tensor|a|
|output_stream: "b"|input|none|ov::Tensor|b|
|output_stream: "b"|output|none|ov::Tensor|b|
|input_stream: "IMAGE:a"|input|IMAGE|mediapipe::ImageFrame|a|
|output_stream: "IMAGE:b"|output|IMAGE|mediapipe::ImageFrame|b|
|input_stream: "OVTENSOR:a"|output|OVTENSOR|ov::Tensor|a|
|input_stream: "OVTENSOR:a"|input|OVTENSOR|ov::Tensor|a|
|output_stream: "OVTENSOR:b"|output|OVTENSOR|ov::Tensor|b|
|input_stream: "REQUEST:req"|input|REQUEST|KServe inference::ModelInferRequest|req|
|output_stream: "RESPONSE:res"|output|RESPONSE|KServe inference::ModelInferResponse|res|

In case of missing tag OpenVINO Model Server assumes that the packet type is `ov::Tensor'. The stream name can be arbitrary but the convention is to use a lower case word.
In case of missing tag OpenVINO Model Server assumes that the packet type is `ov::Tensor`. The stream name can be arbitrary but the convention is to use a lowercase word.

The required data layout for the MediaPipe `IMAGE` conversion is HWC and the supported precisions are:
|Datatype|Allowed number of channels|
Expand Down Expand Up @@ -110,7 +110,7 @@ client.async_stream_infer(
```

### List of default calculators
Beside OpenVINO inference calculators, model server public docker image also includes all the calculators used in the enabled demos.
Besides OpenVINO inference calculators, model server public docker image also includes all the calculators used in the enabled demos.
The list of all included calculators, subgraphs, input/output stream handler is reported in the model server is started with extra parameter `--log_level TRACE`.

### CPU and GPU execution
Expand Down
6 changes: 3 additions & 3 deletions docs/models_repository_graph.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Graphs Repository {#ovms_docs_models_repository_graph}

Model server can deploy a pipelines of models and nodes for any complex and custom transformations.
From the client perspective of behaves almost like a single model but it more flexible and configurable.
From the client perspective it behaves almost like a single model, but it is more flexible and configurable.

The model repository employing graphs is similar in the structure to [classic models](./models_repository_classic.md).
It needs to include the collection of models used in the pipeline. It also require a MediaPipe graph definition file in .pbtxt format.
It needs to include the collection of models used in the pipeline. It also requires a MediaPipe graph definition file in .pbtxt format.

```
graph_models
Expand All @@ -21,7 +21,7 @@ graph_models
└── config.json
```

In can the graph includes python nodes, there should be included also a python file with the node implementation.
In case the graph includes python nodes, there should be included also a python file with the node implementation.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line uses inconsistent capitalization and awkward phrasing (β€œpython nodes”, β€œincluded also a python file”). Consider using β€œPython nodes” / β€œPython file” and rephrasing to a more direct construction (e.g., β€œthe repository should also include a Python file implementing the node”).

Copilot uses AI. Check for mistakes.


For more information on how to use MediaPipe graphs, refer to the [article](./mediapipe.md).
Expand Down
2 changes: 1 addition & 1 deletion docs/performance_tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ $ cpupower frequency-set --min 3.1GHz

## Network Configuration for Optimal Performance

By default, OVMS endpoints are bound to all ipv4 addresses. On same systems, which route localhost name to ipv6 address, it might cause extra time on the client side to switch to ipv4. It can effectively results with extra 1-2s latency.
By default, OVMS endpoints are bound to all ipv4 addresses. On same systems, which route localhost name to ipv6 address, it might cause extra time on the client side to switch to ipv4. It can effectively result in extra 1-2s latency.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence has a couple of remaining issues: β€œipv4/ipv6” should be capitalized as β€œIPv4/IPv6”, and β€œOn same systems” should likely be β€œOn some systems” (current wording reads incorrect).

Copilot uses AI. Check for mistakes.
It can be overcome by switching the API URL to `http://127.0.0.1` on the client side.

To optimize network connection performance:
Expand Down
2 changes: 1 addition & 1 deletion docs/security_considerations.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ OVMS supports multimodal models with image inputs provided as URL. However, to p
OpenVINO Model Server has a set of mechanisms preventing denial of service attacks from the client applications. They include the following:
- setting the number of inference execution streams which can limit the number of parallel inference calls in progress for each model. It can be tuned with `NUM_STREAMS` or `PERFORMANCE_HINT` plugin config.
- setting the maximum number of gRPC threads which is, by default, configured to the number 8 * number_of_cores. It can be changed with the parameter `--grpc_max_threads`.
- setting the maximum number of REST workers which is, be default, configured to the number 4 * number_of_cores. It can be changed with the parameter `--rest_workers`.
- setting the maximum number of REST workers which is, by default, configured to the number 4 * number_of_cores. It can be changed with the parameter `--rest_workers`.
- maximum size of REST and GRPC message which is 1GB - bigger messages will be rejected
- setting max_concurrent_streams which defines how many concurrent threads can be initiated from a single client - the remaining will be queued. The default is equal to the number of CPU cores. It can be changed with the `--grpc_channel_arguments grpc.max_concurrent_streams=8`.
- setting the gRPC memory quota for the requests buffer - the default is 2GB. It can be changed with `--grpc_memory_quota=2147483648`. Value `0` invalidates the quota.
Expand Down
2 changes: 1 addition & 1 deletion docs/speech_recognition/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ The calculator supports the following `node_options` for tuning the pipeline con
We recommend using [export script](../../demos/common/export_models/README.md) to prepare models directory structure for serving.
Check [supported models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#speech-recognition-models).

### Text to speech calculator limitations
### Speech to text calculator limitations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalkulakowski is it correct?

- Streaming is not supported

## References
Expand Down
2 changes: 1 addition & 1 deletion docs/starting_server.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Starting the Server {#ovms_docs_serving_model}

There are two method for passing to the model server information about the models and their configuration:
There are two methods for passing to the model server information about the models and their configuration:
- via CLI parameters - for a single model or pipeline
- via config file in json format - for any number of models and pipelines

Expand Down