-
Notifications
You must be signed in to change notification settings - Fork 52
Add Monitor conversations page for debounced scoring #2371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
anastasiaguspan
wants to merge
7
commits into
main
Choose a base branch
from
monitor-conv-docs-1924
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+52
−0
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
ad0b7ea
Add Monitor conversations page for debounced scoring (WBDOCS-1924)
anastasiaguspan ca7e9d7
Apply review edits to monitor conversations page (WBDOCS-1924)
anastasiaguspan a06f422
Update weave/guides/evaluation/monitor-conversations.mdx
anastasiaguspan b7dfc18
Update weave/guides/evaluation/monitor-conversations.mdx
anastasiaguspan 577ab53
Update weave/guides/evaluation/monitor-conversations.mdx
anastasiaguspan 2b892fc
adding PR-requested update
anastasiaguspan e56a924
Merge branch 'main' into monitor-conv-docs-1924
anastasiaguspan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| --- | ||
| title: "Monitor conversations" | ||
| description: "Score grouped Calls after a conversation goes idle using debounced scoring" | ||
| --- | ||
|
|
||
| Monitors use LLM judges (scorers) to passively score production traffic and surface trends and issues in your LLM applications. | ||
|
|
||
| When your application handles multi-turn conversations, such as audio calls or chat threads, you can use debounced scoring to score entire conversations rather than individual Calls. Debounced scoring in W&B Weave lets you group related Calls and score them after the conversation goes idle, so your scorer has access to the full context. | ||
|
|
||
| For example, if your application uses OpenAI's Realtime APIs, each trace contains multiple `realtime.response` Calls. Debounced scoring waits for the conversation to go idle, then scores the relevant Calls as a group. | ||
|
|
||
| For general monitor setup, see [Set up monitors](/weave/guides/evaluation/monitors). | ||
|
|
||
| ## Configure debounced scoring | ||
|
|
||
| To enable debounced scoring on a monitor: | ||
|
|
||
| 1. [Create a new monitor](/weave/guides/evaluation/monitors#how-to-create-a-monitor-in-weave) or edit an existing one. | ||
| 2. Toggle **Debounced Scoring** on. This reveals the following fields: | ||
| - **Aggregation field**: The field used to group Calls. Select **Trace Id** to group Calls within a single trace, or **Thread Id** to group Calls across a broader conversation thread. | ||
| - **Aggregation method**: How Calls in the group are scored. Select **Last message** to score only the most recent Call in the group, or **All messages** to include all Calls in the group. | ||
| - **Timeout (minutes)**: How long to wait after the last Call completes before scoring. After the timeout elapses, Weave checks whether a newer Call has arrived in the group. If not, Weave scores the group. | ||
| 3. Configure the **LLM-as-a-judge configuration** section as you would for any monitor. See [Set up monitors](/weave/guides/evaluation/monitors#how-to-create-a-monitor-in-weave) for details on these fields. | ||
| 4. Select **Create monitor** or **Update monitor**. | ||
|
|
||
| ## Choose an aggregation method | ||
| You can aggregate Calls using two different methods: last message or all messages. | ||
| ### Last message (Recommended) | ||
|
|
||
| Use the **Last message** method when each Call in the conversation contains the full conversation history. This is the case when you use OpenAI's Realtime APIs, where every `realtime.response` Call contains the complete audio conversation up to that point. | ||
|
|
||
| Set the **Aggregation field** to **Trace Id** and the **Aggregation method** to **Last message**. After the timeout elapses, Weave scores only the most recent Call in the trace, which already contains the full conversation. | ||
|
|
||
| This method uses fewer resources because the scorer only processes one Call per group. | ||
|
|
||
| ### All messages | ||
|
|
||
| Use the **All messages** method when individual Calls do not contain the full conversation history. In this case, Weave extracts content from every Call in the aggregation group and passes it all to the scorer. | ||
|
|
||
| You can set the **Aggregation field** to **Thread Id** for broader grouping flexibility, and the **Aggregation method** to **All messages**. | ||
|
|
||
| This method uses more resources because the scorer processes every Call in the group. | ||
|
|
||
| ## Timeout considerations | ||
|
|
||
| The timeout value controls the trade-off between scoring latency and accuracy: | ||
|
|
||
| - Shorter timeouts score conversations faster but risk scoring before the conversation is complete. Use shorter timeouts for debugging or when conversations have predictable end points. | ||
| - Longer timeouts wait longer to confirm the conversation is idle, reducing the chance of premature scoring. Use longer timeouts in production, especially for conversations with variable pauses between Calls. Longer timeouts increase server load. | ||
|
|
||
| For example, a timeout of `0.25` minutes (15 seconds) is useful during development, while a timeout of several minutes might be appropriate for production workloads. | ||
anastasiaguspan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly mention how debounced scoring differs from monitors in here somewhere? The doc doesn't mention monitoring at all up until this point, and it feels like we're randomly introducing a new concept here.