From ad0b7eaacfc82212febe8be1c511566727c0b6db Mon Sep 17 00:00:00 2001 From: anastasiaguspan Date: Thu, 26 Mar 2026 14:06:07 -0400 Subject: [PATCH 1/6] Add Monitor conversations page for debounced scoring (WBDOCS-1924) New how-to guide covering debounced scoring for grouped calls in multi-turn conversations and audio threads. Documents the Aggregation field, Aggregation method, and Timeout settings. Made-with: Cursor --- docs.json | 1 + .../evaluation/monitor-conversations.mdx | 49 +++++++++++++++++++ weave/guides/evaluation/monitors.mdx | 2 + 3 files changed, 52 insertions(+) create mode 100644 weave/guides/evaluation/monitor-conversations.mdx diff --git a/docs.json b/docs.json index 08f4b95890..262e8dd187 100644 --- a/docs.json +++ b/docs.json @@ -802,6 +802,7 @@ }, "weave/guides/tracking/redact-pii", "weave/guides/evaluation/monitors", + "weave/guides/evaluation/monitor-conversations", "weave/guides/evaluation/guardrails", "weave/guides/tracking/otel" ] diff --git a/weave/guides/evaluation/monitor-conversations.mdx b/weave/guides/evaluation/monitor-conversations.mdx new file mode 100644 index 0000000000..d5bce0fb44 --- /dev/null +++ b/weave/guides/evaluation/monitor-conversations.mdx @@ -0,0 +1,49 @@ +--- +title: "Monitor conversations" +description: "Score grouped calls after a conversation goes idle using debounced scoring" +--- + +When your application handles multi-turn conversations, such as audio calls or chat threads, you might need to score the entire conversation rather than individual calls. Debounced scoring lets you group related calls and score them after the conversation goes idle, so your scorer has access to the full context. + +For example, if your application uses OpenAI's Realtime APIs, each trace contains multiple `realtime.response` calls. Debounced scoring waits for the conversation to go idle, then scores the relevant calls as a group. + +For general monitor setup, see [Set up monitors](/weave/guides/evaluation/monitors). + +## Configure debounced scoring + +To enable debounced scoring on a monitor: + +1. [Create a new monitor](/weave/guides/evaluation/monitors#how-to-create-a-monitor-in-weave) or edit an existing one. +2. Toggle **Debounced Scoring** on. This reveals the following fields: + - **Aggregation field**: The field used to group calls. Select **Trace Id** to group calls within a single trace, or **Thread Id** to group calls across a broader conversation thread. + - **Aggregation method**: How calls in the group are scored. Select **Last message** to score only the most recent call in the group, or **All messages** to include all calls in the group. + - **Timeout (minutes)**: How long to wait after the last call completes before scoring. After the timeout elapses, Weave checks whether a newer call has arrived in the group. If not, Weave scores the group. +3. Configure the **LLM-as-a-judge configuration** section as you would for any monitor. See [Set up monitors](/weave/guides/evaluation/monitors#how-to-create-a-monitor-in-weave) for details on these fields. +4. Select **Create monitor** or **Update monitor**. + +## Choose an aggregation method + +### Last message (recommended) + +Use the **Last message** method when each call in the conversation contains the full conversation history. This is the case when you use OpenAI's Realtime APIs, where every `realtime.response` call contains the complete audio conversation up to that point. + +Set the **Aggregation field** to **Trace Id** and the **Aggregation method** to **Last message**. After the timeout elapses, Weave scores only the most recent call in the trace, which already contains the full conversation. + +This method uses fewer resources because only one call per group is scored. + +### All messages + +Use the **All messages** method when individual calls do not contain the full conversation history. In this case, Weave extracts content from every call in the aggregation group and passes it all to the scorer. + +Set the **Aggregation field** to **Thread Id** for broader grouping flexibility, and the **Aggregation method** to **All messages**. + +This method uses more resources because the scorer processes every call in the group. + +## Timeout considerations + +The timeout value controls the trade-off between scoring latency and accuracy: + +- **Shorter timeouts** score conversations faster but risk scoring before the conversation is complete. Use shorter timeouts for debugging or when conversations have predictable end points. +- **Longer timeouts** wait longer to confirm the conversation is idle, reducing the chance of premature scoring. Use longer timeouts in production, especially for conversations with variable pauses between calls. Longer timeouts increase server load. + +For example, a timeout of `0.25` minutes (15 seconds) is useful during development, while a timeout of several minutes is more appropriate for production workloads. diff --git a/weave/guides/evaluation/monitors.mdx b/weave/guides/evaluation/monitors.mdx index 4746129500..ccee321ed3 100644 --- a/weave/guides/evaluation/monitors.mdx +++ b/weave/guides/evaluation/monitors.mdx @@ -9,6 +9,8 @@ You can monitor text, images, and audio in your application's input and output. Monitors require no code changes to your application. Set them up using the W&B Weave UI. +To score grouped calls in multi-turn conversations or audio threads after they go idle, see [Monitor conversations](/weave/guides/evaluation/monitor-conversations). + If you need to actively intervene in your application's behavior based on scores, use [guardrails](/weave/guides/evaluation/guardrails) instead. ## How to create a monitor in Weave From ca7e9d72cd8cc2d241a222d5a8478a7187f1d09f Mon Sep 17 00:00:00 2001 From: anastasiaguspan Date: Thu, 26 Mar 2026 14:39:24 -0400 Subject: [PATCH 2/6] Apply review edits to monitor conversations page (WBDOCS-1924) Capitalize Calls to match Weave terminology, remove cross-reference from monitors.mdx, and soften wording in timeout section. Made-with: Cursor --- .../evaluation/monitor-conversations.mdx | 32 +++++++++---------- weave/guides/evaluation/monitors.mdx | 2 -- 2 files changed, 16 insertions(+), 18 deletions(-) diff --git a/weave/guides/evaluation/monitor-conversations.mdx b/weave/guides/evaluation/monitor-conversations.mdx index d5bce0fb44..24ff344ffd 100644 --- a/weave/guides/evaluation/monitor-conversations.mdx +++ b/weave/guides/evaluation/monitor-conversations.mdx @@ -1,11 +1,11 @@ --- title: "Monitor conversations" -description: "Score grouped calls after a conversation goes idle using debounced scoring" +description: "Score grouped Calls after a conversation goes idle using debounced scoring" --- -When your application handles multi-turn conversations, such as audio calls or chat threads, you might need to score the entire conversation rather than individual calls. Debounced scoring lets you group related calls and score them after the conversation goes idle, so your scorer has access to the full context. +When your application handles multi-turn conversations, such as audio calls or chat threads, you might need to score the entire conversation rather than individual Calls. Debounced scoring in W&B Weave lets you group related Calls and score them after the conversation goes idle, so your scorer has access to the full context. -For example, if your application uses OpenAI's Realtime APIs, each trace contains multiple `realtime.response` calls. Debounced scoring waits for the conversation to go idle, then scores the relevant calls as a group. +For example, if your application uses OpenAI's Realtime APIs, each trace contains multiple `realtime.response` Calls. Debounced scoring waits for the conversation to go idle, then scores the relevant Calls as a group. For general monitor setup, see [Set up monitors](/weave/guides/evaluation/monitors). @@ -15,35 +15,35 @@ To enable debounced scoring on a monitor: 1. [Create a new monitor](/weave/guides/evaluation/monitors#how-to-create-a-monitor-in-weave) or edit an existing one. 2. Toggle **Debounced Scoring** on. This reveals the following fields: - - **Aggregation field**: The field used to group calls. Select **Trace Id** to group calls within a single trace, or **Thread Id** to group calls across a broader conversation thread. - - **Aggregation method**: How calls in the group are scored. Select **Last message** to score only the most recent call in the group, or **All messages** to include all calls in the group. - - **Timeout (minutes)**: How long to wait after the last call completes before scoring. After the timeout elapses, Weave checks whether a newer call has arrived in the group. If not, Weave scores the group. + - **Aggregation field**: The field used to group Calls. Select **Trace Id** to group Calls within a single trace, or **Thread Id** to group Calls across a broader conversation thread. + - **Aggregation method**: How Calls in the group are scored. Select **Last message** to score only the most recent Call in the group, or **All messages** to include all Calls in the group. + - **Timeout (minutes)**: How long to wait after the last Call completes before scoring. After the timeout elapses, Weave checks whether a newer Call has arrived in the group. If not, Weave scores the group. 3. Configure the **LLM-as-a-judge configuration** section as you would for any monitor. See [Set up monitors](/weave/guides/evaluation/monitors#how-to-create-a-monitor-in-weave) for details on these fields. 4. Select **Create monitor** or **Update monitor**. ## Choose an aggregation method -### Last message (recommended) +### Last message (Recommended) -Use the **Last message** method when each call in the conversation contains the full conversation history. This is the case when you use OpenAI's Realtime APIs, where every `realtime.response` call contains the complete audio conversation up to that point. +Use the **Last message** method when each Call in the conversation contains the full conversation history. This is the case when you use OpenAI's Realtime APIs, where every `realtime.response` Call contains the complete audio conversation up to that point. -Set the **Aggregation field** to **Trace Id** and the **Aggregation method** to **Last message**. After the timeout elapses, Weave scores only the most recent call in the trace, which already contains the full conversation. +Set the **Aggregation field** to **Trace Id** and the **Aggregation method** to **Last message**. After the timeout elapses, Weave scores only the most recent Call in the trace, which already contains the full conversation. -This method uses fewer resources because only one call per group is scored. +This method uses fewer resources because only one Call per group is scored. ### All messages -Use the **All messages** method when individual calls do not contain the full conversation history. In this case, Weave extracts content from every call in the aggregation group and passes it all to the scorer. +Use the **All messages** method when individual Calls do not contain the full conversation history. In this case, Weave extracts content from every Call in the aggregation group and passes it all to the scorer. -Set the **Aggregation field** to **Thread Id** for broader grouping flexibility, and the **Aggregation method** to **All messages**. +You can set the **Aggregation field** to **Thread Id** for broader grouping flexibility, and the **Aggregation method** to **All messages**. -This method uses more resources because the scorer processes every call in the group. +This method uses more resources because the scorer processes every Call in the group. ## Timeout considerations The timeout value controls the trade-off between scoring latency and accuracy: -- **Shorter timeouts** score conversations faster but risk scoring before the conversation is complete. Use shorter timeouts for debugging or when conversations have predictable end points. -- **Longer timeouts** wait longer to confirm the conversation is idle, reducing the chance of premature scoring. Use longer timeouts in production, especially for conversations with variable pauses between calls. Longer timeouts increase server load. +- Shorter timeouts score conversations faster but risk scoring before the conversation is complete. Use shorter timeouts for debugging or when conversations have predictable end points. +- Longer timeouts wait longer to confirm the conversation is idle, reducing the chance of premature scoring. Use longer timeouts in production, especially for conversations with variable pauses between Calls. Longer timeouts increase server load. -For example, a timeout of `0.25` minutes (15 seconds) is useful during development, while a timeout of several minutes is more appropriate for production workloads. +For example, a timeout of `0.25` minutes (15 seconds) is useful during development, while a timeout of several minutes might be appropriate for production workloads. diff --git a/weave/guides/evaluation/monitors.mdx b/weave/guides/evaluation/monitors.mdx index ccee321ed3..4746129500 100644 --- a/weave/guides/evaluation/monitors.mdx +++ b/weave/guides/evaluation/monitors.mdx @@ -9,8 +9,6 @@ You can monitor text, images, and audio in your application's input and output. Monitors require no code changes to your application. Set them up using the W&B Weave UI. -To score grouped calls in multi-turn conversations or audio threads after they go idle, see [Monitor conversations](/weave/guides/evaluation/monitor-conversations). - If you need to actively intervene in your application's behavior based on scores, use [guardrails](/weave/guides/evaluation/guardrails) instead. ## How to create a monitor in Weave From a06f4227f3e60076aa9f2ad68fec51ea14c6f68a Mon Sep 17 00:00:00 2001 From: Anastasia Guspan Date: Tue, 31 Mar 2026 08:10:49 -0400 Subject: [PATCH 3/6] Update weave/guides/evaluation/monitor-conversations.mdx Co-authored-by: Dan Brian --- weave/guides/evaluation/monitor-conversations.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/weave/guides/evaluation/monitor-conversations.mdx b/weave/guides/evaluation/monitor-conversations.mdx index 24ff344ffd..96cfd83875 100644 --- a/weave/guides/evaluation/monitor-conversations.mdx +++ b/weave/guides/evaluation/monitor-conversations.mdx @@ -3,7 +3,7 @@ title: "Monitor conversations" description: "Score grouped Calls after a conversation goes idle using debounced scoring" --- -When your application handles multi-turn conversations, such as audio calls or chat threads, you might need to score the entire conversation rather than individual Calls. Debounced scoring in W&B Weave lets you group related Calls and score them after the conversation goes idle, so your scorer has access to the full context. +When your application handles multi-turn conversations, such as audio calls or chat threads, you can use debounced scoring to score entire conversations rather than individual Calls. Debounced scoring in W&B Weave lets you group related Calls and score them after the conversation goes idle, so your scorer has access to the full context. For example, if your application uses OpenAI's Realtime APIs, each trace contains multiple `realtime.response` Calls. Debounced scoring waits for the conversation to go idle, then scores the relevant Calls as a group. From b7dfc1809d46b57def4e1f8e208fbae2095c60c8 Mon Sep 17 00:00:00 2001 From: Anastasia Guspan Date: Tue, 31 Mar 2026 08:11:34 -0400 Subject: [PATCH 4/6] Update weave/guides/evaluation/monitor-conversations.mdx Co-authored-by: Dan Brian --- weave/guides/evaluation/monitor-conversations.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/weave/guides/evaluation/monitor-conversations.mdx b/weave/guides/evaluation/monitor-conversations.mdx index 96cfd83875..5825a7b4dc 100644 --- a/weave/guides/evaluation/monitor-conversations.mdx +++ b/weave/guides/evaluation/monitor-conversations.mdx @@ -22,7 +22,7 @@ To enable debounced scoring on a monitor: 4. Select **Create monitor** or **Update monitor**. ## Choose an aggregation method - +You can aggregate Calls using two different methods: last message or all messages. ### Last message (Recommended) Use the **Last message** method when each Call in the conversation contains the full conversation history. This is the case when you use OpenAI's Realtime APIs, where every `realtime.response` Call contains the complete audio conversation up to that point. From 577ab53d2872b1d31dfdfa07155e2cc60277af3a Mon Sep 17 00:00:00 2001 From: Anastasia Guspan Date: Tue, 31 Mar 2026 08:11:58 -0400 Subject: [PATCH 5/6] Update weave/guides/evaluation/monitor-conversations.mdx Co-authored-by: Dan Brian --- weave/guides/evaluation/monitor-conversations.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/weave/guides/evaluation/monitor-conversations.mdx b/weave/guides/evaluation/monitor-conversations.mdx index 5825a7b4dc..a861f09622 100644 --- a/weave/guides/evaluation/monitor-conversations.mdx +++ b/weave/guides/evaluation/monitor-conversations.mdx @@ -29,7 +29,7 @@ Use the **Last message** method when each Call in the conversation contains the Set the **Aggregation field** to **Trace Id** and the **Aggregation method** to **Last message**. After the timeout elapses, Weave scores only the most recent Call in the trace, which already contains the full conversation. -This method uses fewer resources because only one Call per group is scored. +This method uses fewer resources because the scorer only processes one Call per group. ### All messages From 2b892fcd37be2b86a98739483a18d3b0a2714f6b Mon Sep 17 00:00:00 2001 From: anastasiaguspan Date: Tue, 31 Mar 2026 09:52:47 -0400 Subject: [PATCH 6/6] adding PR-requested update --- weave/guides/evaluation/monitor-conversations.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/weave/guides/evaluation/monitor-conversations.mdx b/weave/guides/evaluation/monitor-conversations.mdx index a861f09622..b5dabf0162 100644 --- a/weave/guides/evaluation/monitor-conversations.mdx +++ b/weave/guides/evaluation/monitor-conversations.mdx @@ -3,6 +3,8 @@ title: "Monitor conversations" description: "Score grouped Calls after a conversation goes idle using debounced scoring" --- +Monitors use LLM judges (scorers) to passively score production traffic and surface trends and issues in your LLM applications. + When your application handles multi-turn conversations, such as audio calls or chat threads, you can use debounced scoring to score entire conversations rather than individual Calls. Debounced scoring in W&B Weave lets you group related Calls and score them after the conversation goes idle, so your scorer has access to the full context. For example, if your application uses OpenAI's Realtime APIs, each trace contains multiple `realtime.response` Calls. Debounced scoring waits for the conversation to go idle, then scores the relevant Calls as a group.