Skip to content

feat: agent evals — retrieval, scoring, and eval canary#91

Draft
kylehounslow wants to merge 5 commits intoopensearch-project:mainfrom
kylehounslow:feat/agent-evals
Draft

feat: agent evals — retrieval, scoring, and eval canary#91
kylehounslow wants to merge 5 commits intoopensearch-project:mainfrom
kylehounslow:feat/agent-evals

Conversation

@kylehounslow
Copy link
Collaborator

What

End-to-end agent evaluation loop for observability-stack:

  1. Retrieve agent traces from OpenSearch via genai-observability-sdk-py
  2. Run evaluations (deterministic or LLM-as-judge via Bedrock)
  3. Write score spans back to OpenSearch as OTel spans
  4. Score spans appear in the same trace waterfall as agent spans

Draft: Depends on genai-observability-sdk-py PR #19 (OpenSearchTraceRetriever) being merged and published. Will mark ready for review once that lands.

Changes

examples/agent-evals/genai-sdk/

CLI tool for running evals against stored traces.

# Mock evaluator (no AWS needed)
uv run main.py --mock --trace-id <trace_id>

# LLM-as-judge via Bedrock Claude
uv run main.py --trace-id <trace_id>

# By conversation/session ID
uv run main.py <session_id>

docker-compose/agent-eval-canary/

Containerized eval canary that runs on a timer:

  • Polls OpenSearch for recent agent traces (configurable lookback window)
  • Skips traces that already have an evaluation span
  • Runs deterministic completeness scoring (input/output/tools present)
  • Writes evaluation completeness score spans back via score()
  • Provides steady eval data for UI/UX dashboard development

Uses only the genai-sdk API — no raw OpenSearch client:

retriever = OpenSearchTraceRetriever(host=..., auth=...)
roots = retriever.list_root_spans(services=TARGET_SERVICES, since_minutes=15)
evaluated = retriever.find_evaluated_trace_ids([r.trace_id for r in roots])
# score unevaluated traces...

Configuration (.env)

EVAL_CANARY_INTERVAL=120        # poll interval in seconds
EVAL_CANARY_LOOKBACK_MINUTES=15 # how far back to look for traces

Files

  • docker-compose/agent-eval-canary/ — Dockerfile, pyproject.toml, uv.lock, eval_canary.py
  • docker-compose.examples.yml — agent-eval-canary service definition
  • .env — canary configuration variables
  • examples/agent-evals/genai-sdk/ — main.py, README.md, pyproject.toml, uv.lock

Dependencies

Testing

  • Eval canary verified E2E against local observability-stack (finch compose)
  • Scores agent traces from travel-planner, weather-agent, events-agent examples
  • Score spans visible in OpenSearch Dashboards trace waterfall

End-to-end evaluation loop:
1. Retrieve agent traces from OpenSearch (via genai-observability-sdk-py)
2. Run HelpfulnessEvaluator (strands-agents/evals, Bedrock Claude)
3. Write score spans back to OpenSearch via score()

Score spans follow OTel GenAI semantic conventions and appear in the
same trace waterfall as the original agent spans.

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
- Add --trace-id flag to target a specific trace (skip session lookup)
- Make session_id positional arg optional when --trace-id is used
- Update README with mock and LLM-as-judge usage examples

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Polls OpenSearch every N seconds for recent agent traces that don't
yet have an evaluation span. Runs deterministic completeness scoring
(input/output/tools) and writes score spans back via genai-sdk score().

Provides steady eval data for UI/UX dashboard development.

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant