feat: agent evals — retrieval, scoring, and eval canary by kylehounslow · Pull Request #91 · opensearch-project/observability-stack

kylehounslow · 2026-03-16T22:27:09Z

What

End-to-end agent evaluation loop for observability-stack:

Retrieve agent traces from OpenSearch via genai-observability-sdk-py
Run evaluations (deterministic or LLM-as-judge via Bedrock)
Write score spans back to OpenSearch as OTel spans
Score spans appear in the same trace waterfall as agent spans

Draft: Depends on genai-observability-sdk-py PR #19 (OpenSearchTraceRetriever) being merged and published. Will mark ready for review once that lands.

Changes

examples/agent-evals/genai-sdk/

CLI tool for running evals against stored traces.

# Mock evaluator (no AWS needed)
uv run main.py --mock --trace-id <trace_id>

# LLM-as-judge via Bedrock Claude
uv run main.py --trace-id <trace_id>

# By conversation/session ID
uv run main.py <session_id>

docker-compose/agent-eval-canary/

Containerized eval canary that runs on a timer:

Polls OpenSearch for recent agent traces (configurable lookback window)
Skips traces that already have an evaluation span
Runs deterministic completeness scoring (input/output/tools present)
Writes evaluation completeness score spans back via score()
Provides steady eval data for UI/UX dashboard development

Uses only the genai-sdk API — no raw OpenSearch client:

retriever = OpenSearchTraceRetriever(host=..., auth=...)
roots = retriever.list_root_spans(services=TARGET_SERVICES, since_minutes=15)
evaluated = retriever.find_evaluated_trace_ids([r.trace_id for r in roots])
# score unevaluated traces...

Configuration (.env)

EVAL_CANARY_INTERVAL=120        # poll interval in seconds
EVAL_CANARY_LOOKBACK_MINUTES=15 # how far back to look for traces

Files

docker-compose/agent-eval-canary/ — Dockerfile, pyproject.toml, uv.lock, eval_canary.py
docker-compose.examples.yml — agent-eval-canary service definition
.env — canary configuration variables
examples/agent-evals/genai-sdk/ — main.py, README.md, pyproject.toml, uv.lock

Dependencies

opensearch-project/genai-observability-sdk-py#19 — OpenSearchTraceRetriever (list_root_spans, find_evaluated_trace_ids, get_traces)
opensearch-project/genai-observability-sdk-py#18 — gen_ai.evaluation.result event on score spans

Testing

Eval canary verified E2E against local observability-stack (finch compose)
Scores agent traces from travel-planner, weather-agent, events-agent examples
Score spans visible in OpenSearch Dashboards trace waterfall

End-to-end evaluation loop: 1. Retrieve agent traces from OpenSearch (via genai-observability-sdk-py) 2. Run HelpfulnessEvaluator (strands-agents/evals, Bedrock Claude) 3. Write score spans back to OpenSearch via score() Score spans follow OTel GenAI semantic conventions and appear in the same trace waterfall as the original agent spans. Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>

- Add --trace-id flag to target a specific trace (skip session lookup) - Make session_id positional arg optional when --trace-id is used - Update README with mock and LLM-as-judge usage examples Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>

Polls OpenSearch every N seconds for recent agent traces that don't yet have an evaluation span. Runs deterministic completeness scoring (input/output/tools) and writes score spans back via genai-sdk score(). Provides steady eval data for UI/UX dashboard development. Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>

kylehounslow added 4 commits March 16, 2026 14:47

chore: add EVAL_CANARY_INTERVAL and EVAL_CANARY_LOOKBACK_MINUTES to .env

3ed481d

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>

kylehounslow mentioned this pull request Mar 16, 2026

feat: add OpenSearchTraceRetriever for trace retrieval opensearch-project/genai-observability-sdk-py#19

Open

5 tasks

kylehounslow force-pushed the feat/agent-evals branch 4 times, most recently from a06d787 to 796f262 Compare March 16, 2026 22:47

chore: use uv for Python dependency management

f43db39

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>

kylehounslow force-pushed the feat/agent-evals branch from 796f262 to f43db39 Compare March 16, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: agent evals — retrieval, scoring, and eval canary#91

feat: agent evals — retrieval, scoring, and eval canary#91
kylehounslow wants to merge 5 commits intoopensearch-project:mainfrom
kylehounslow:feat/agent-evals

kylehounslow commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kylehounslow commented Mar 16, 2026

What

Changes

examples/agent-evals/genai-sdk/

docker-compose/agent-eval-canary/

Configuration (.env)

Files

Dependencies

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant