diff --git a/src/data/nav/aitransport.ts b/src/data/nav/aitransport.ts
index dbfddf7cb5..1f8dfa12f3 100644
--- a/src/data/nav/aitransport.ts
+++ b/src/data/nav/aitransport.ts
@@ -29,6 +29,10 @@ export default {
name: 'Message per token',
link: '/docs/ai-transport/token-streaming/message-per-token',
},
+ {
+ name: 'Token streaming limits',
+ link: '/docs/ai-transport/token-streaming/token-rate-limits',
+ },
],
},
{
diff --git a/src/pages/docs/ai-transport/messaging/token-rate-limits.mdx b/src/pages/docs/ai-transport/messaging/token-rate-limits.mdx
new file mode 100644
index 0000000000..b5b3792e7a
--- /dev/null
+++ b/src/pages/docs/ai-transport/messaging/token-rate-limits.mdx
@@ -0,0 +1,74 @@
+---
+title: Token streaming limits
+meta_description: "Learn how token streaming interacts with Ably message limits and how to ensure your application delivers consistent performance."
+---
+
+LLM token streaming introduces high-rate or bursty traffic patterns to your application, with some models outputting upwards of 150 distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. AI Transport provides functionality to help you stay within your [rate limits](/docs/platform/pricing/limits) while delivering a great experience to your users.
+
+Ably's limits divide into two categories:
+
+1. Limits relating to usage across an account, such as the total number of messages sent in a month, or the aggregate instantaneous message rate across all connections and channels
+2. Limits relating to the capacity of a single resource, such as a connection or a channel
+
+Limits in the first category exist to provide protection in the case of accidental spikes or deliberate abuse. Provided that your package is sized correctly for your use-case, these limits should not be hit as a result of valid traffic.
+
+The limits in the second category, however, cannot be increased arbitrarily and exist to protect the integrity of the service. The limits associated with individual connections or channels can be relevant to LLM token streaming use-cases. The following sections discuss these limits in particular.
+
+## Message-per-response
+
+The [message-per-response](/docs/ai-transport/features/token-streaming/message-per-response) pattern includes automatic rate limit protection. AI Transport prevents a single response stream from reaching the message rate limit for a connection by rolling up multiple appends into a single published message:
+
+1. Your agent streams tokens to the channel at the model's output rate
+2. Ably publishes the first token immediately, then automatically rolls up subsequent tokens on receipt
+3. Clients receive the same content, delivered in fewer discrete messages
+
+By default, a single response stream will be delivered at 25 messages per second or the model output rate, whichever is lower. This means you can publish two simultaneous response streams on the same channel or connection with any [Ably package](/docs/platform/pricing#packages), because each stream will be using half of the [connection inbound message rate](/docs/platform/pricing/limits#connection). You will be charged for the number of published messages, not for the number of streamed tokens.
+
+### Configuring rollup behaviour
+
+Ably concatenates all appends for a single response that are received during the rollup window into one published message. You can specify the rollup window for a particular connection by setting the `appendRollupWindow` [transport parameter](/docs/api/realtime-sdk#client-options). This allows you to determine how much of the connection message rate can be consumed by a single response stream and control your consumption costs.
+
+
+| `appendRollupWindow` | Maximum message rate for a single response |
+|---|---|
+| 0ms | Model output rate |
+| 20ms | 50 messages/s |
+| 40ms *(default)* | 25 messages/s |
+| 100ms | 10 messages/s |
+| 500ms *(max)* | 2 messages/s |
+
+The following example code demonstrates establishing a connection to Ably with `appendRollupWindow` set to 100ms:
+
+
+```javascript
+const ably = new Ably.Realtime(
+ {
+ key: 'your-api-key',
+ transportParams: { appendRollupWindow: 100 }
+ }
+);
+```
+
+
+
+
+## Message-per-token
+
+The [message-per-token](/docs/ai-transport/features/token-streaming/message-per-token) pattern requires you to manage rate limits directly. Each token publishes as a separate message, so high-speed model output can cause per-connection or per-channel rate limits to be hit, as well as consuming overall message allowances quickly.
+
+To stay within limits:
+
+- Calculate your headroom by comparing your model's peak output rate against your package's [connection inbound message rate](/docs/platform/pricing/limits#connection)
+- Account for concurrency by multiplying peak rates by the maximum number of simultaneous streams your application supports
+- If required, batch tokens in your agent before publishing to the SDK, reducing message count while maintaining delivery speed
+- Enable [server-side batching](/docs/messages/batch#server-side) to reduce the number of messages delivered to your subscribers
+
+If your application requires higher message rates than your current package allows, [contact Ably](/contact) to discuss options.
+
+## Next steps
+
+- Review [Ably platform limits](/docs/platform/pricing/limits) to understand rate limit thresholds for your package
+- Learn about the [message-per-response](/docs/ai-transport/features/token-streaming/message-per-response) pattern for automatic rate limit protection
+- Learn about the [message-per-token](/docs/ai-transport/features/token-streaming/message-per-token) pattern for fine-grained control
diff --git a/src/pages/docs/ai-transport/token-streaming/message-per-response.mdx b/src/pages/docs/ai-transport/token-streaming/message-per-response.mdx
index 5b4f09934b..d37c2ca6d2 100644
--- a/src/pages/docs/ai-transport/token-streaming/message-per-response.mdx
+++ b/src/pages/docs/ai-transport/token-streaming/message-per-response.mdx
@@ -7,6 +7,8 @@ Token streaming with message-per-response is a pattern where every token generat
This pattern is useful for chat-style applications where you want each complete AI response stored as a single message in history, making it easy to retrieve and display multi-response conversation history. Each agent response becomes a single message that grows as tokens are appended, allowing clients joining mid-stream to catch up efficiently without processing thousands of individual tokens.
+The message-per-response pattern includes [automatic rate limit protection](/docs/ai-transport/features/token-streaming/token-rate-limits#per-response) through rollups, making it the recommended approach for most token streaming use cases.
+
## How it works
1. **Initial message**: When an agent response begins, publish an initial message with `message.create` action to the Ably channel with an empty or the first token as content.
@@ -96,6 +98,30 @@ Append only supports concatenating data of the same type as the original message
This pattern allows publishing append operations for multiple concurrent model responses on the same channel. As long as you append to the correct message serial, tokens from different responses will not interfere with each other, and the final concatenated message for each response will contain only the tokens from that response.
+### Configuring rollup behaviour
+
+By default, AI Transport automatically rolls up tokens into messages at a rate of 25 messages per second (using a 40ms rollup window). This protects you from hitting connection rate limits while maintaining a smooth user experience. You can tune this behaviour when establishing your connection to balance between message costs and delivery speed:
+
+
+```javascript
+const realtime = new Ably.Realtime({
+ key: 'your-api-key',
+ transportParams: { appendRollupWindow: 100 } // 10 messages/s
+});
+```
+
+
+The `appendRollupWindow` parameter controls how many tokens are combined into each published message for a given model output rate. This creates a trade-off between delivery smoothness and the number of concurrent model responses you can stream on a single connection:
+
+- With a shorter rollup window, tokens are published more frequently, creating a smoother, more fluid experience for users as they see the response appear in more fine-grained chunks. However, this consumes more of your [connection's message rate capacity](/docs/platform/pricing/limits#connection), limiting how many simultaneous model responses you can stream.
+- With a longer rollup window, multiple tokens are batched together into fewer messages, allowing you to run more concurrent response streams on the same connection, but users will notice tokens arriving in larger chunks.
+
+The default 40ms window strikes a balance, delivering tokens at 25 messages per second - smooth enough for a great user experience while allowing you to run two simultaneous response streams on a single connection. If you need to support more concurrent streams, increase the rollup window (up to 500ms), accepting that tokens will arrive in more noticeable batches. Alternatively, instantiate a separate Ably client which uses its own connection, giving you access to additional message rate capacity.
+
+
+
## Subscribing to token streams
Subscribers receive different message actions depending on when they join and how they're retrieving messages. Each message has an `action` field that indicates how to process it, and a `serial` field that identifies which message the action relates to:
diff --git a/src/pages/docs/ai-transport/token-streaming/message-per-token.mdx b/src/pages/docs/ai-transport/token-streaming/message-per-token.mdx
index e7d9efe93d..9bddcd8645 100644
--- a/src/pages/docs/ai-transport/token-streaming/message-per-token.mdx
+++ b/src/pages/docs/ai-transport/token-streaming/message-per-token.mdx
@@ -48,6 +48,10 @@ for await (const event of stream) {
This approach maximizes throughput while maintaining ordering guarantees, allowing you to stream tokens as fast as your AI model generates them.
+
+
## Streaming patterns
Ably is a pub/sub messaging platform, so you can structure your messages however works best for your application. Below are common patterns for streaming tokens, each showing both agent-side publishing and client-side subscription. Choose the approach that fits your use case, or create your own variation.