feat(eval): add observability — SDK sends results to server, UI shows them by vishesh-orkes · Pull Request #253 · agentspan-ai/agentspan

vishesh-orkes · 2026-05-19T08:45:57Z

Eval Observability — Phase 1 (Issue #215)

When running CorrectnessEval from the Python SDK, there was no way to see results beyond the terminal. This PR closes that gap: eval results are automatically persisted and visible in a new Experiments section of the UI, and eval runs are kept separate from production agent traffic.

What's new

Python SDK

CorrectnessEval.run() automatically sends the full suite result to the server after each run. Failures don't surface to the caller — eval logic is never blocked by network issues.
Eval results now carry richer metadata: run name, strategy, ran_by, per-case prompt and output, and optional score / reasoning for LLM-judge assertions.
runtime.push_dataset(name, cases) stores a named, reusable test dataset on the server.
Cases can be filtered by tags when calling run().

Server

New eval_runs, eval_cases, eval_checks, and eval_datasets tables (SQLite and Postgres).
REST endpoints: POST /api/eval/runs, GET /api/eval/runs, GET /api/eval/runs/{id}, POST /api/eval/datasets, GET /api/eval/datasets, GET /api/eval/datasets/{name}.
Eval runs are excluded from Agent Executions search results by default. An includeEvalRuns flag re-includes them when needed.

UI — Experiments (new sidebar section)

Eval Runs: table with pass rate bars, pass/fail case counts, search, and agent/result filters. Click any row to see the full run detail — cases, per-check results, agent output, and semantic score with LLM reasoning where applicable.
Datasets: split-panel view listing all pushed datasets on the left; click to inspect individual cases on the right.
Agent Executions: a new Type column distinguishes production runs from eval runs. A "Show eval runs" toggle (next to "Hide sub-agent executions") reveals them when needed.

Tests

Python unit tests covering serialization, session tagging, result posting, tags filtering, and metadata fields — no LLM calls.
Spring Boot integration tests for all eval storage operations.
Playwright E2E tests for the Eval Runs and Datasets pages.

Out of scope (Phase 2)

Triggering an eval run directly from the UI
Creating or editing dataset cases in the UI
Model comparison / benchmarks view

@transactional

- Python SDK: CorrectnessEval posts EvalSuiteResult to server after each run; EvalSuiteResult/EvalCaseResult/EvalCheckResult gain to_dict() serialization, prompt/output capture per case, strategy/ran_by metadata, name field - Server: new eval storage layer (schema-eval.sql + schema-eval-postgres.sql) with eval_runs/eval_cases/eval_checks/eval_datasets tables; EvalController exposes POST/GET /api/eval/runs, /api/eval/runs/{id}, /api/eval/datasets and GET /api/eval/datasets/{name}; EvalService persists with @transactional - Server: AgentService filters eval runs from production search by default; isEvalRun() handles both JSON and Conductor Map.toString() serialization formats; includeEvalRuns param restores them when requested - UI: new Experiments section in sidebar with Eval Runs and Datasets pages; EvalRunsList shows pass-rate progress bar, Cases pass/fail badges, stats row, and search/filter bar; EvalRunDetail shows prompt, agent output, semantic score card, strategy/ran_by metadata; DatasetsList/DatasetDetail show read-only dataset cases - UI: Agent Executions table gains Type column (Production/Eval chips) and "Show eval runs" toggle next to "Hide sub-agent executions"; eval rows shown at reduced opacity; footer links to Experiments → Eval Runs - Tests: 19 Python unit tests (no LLM), EvalServiceTest for server persistence

… to avoid event-loop-closed error

vishesh-orkes force-pushed the feat/eval-observability-215 branch from 0c6acdb to 8d62ae1 Compare May 19, 2026 13:03

vishesh-orkes changed the title ~~feat(eval): add observability — SDK sends results to server, UI shows…~~ feat(eval): add observability — SDK sends results to server, UI shows them May 19, 2026

vishesh-orkes force-pushed the feat/eval-observability-215 branch from c6aa811 to 2de23d2 Compare May 20, 2026 16:56

vishesh-orkes added 2 commits May 21, 2026 10:27

Merge branch 'main' into feat/eval-observability-215

6d7b920

fix(eval): use short-lived AsyncClient for push_dataset/post_eval_run…

b31a9d8

… to avoid event-loop-closed error

vishesh-orkes marked this pull request as ready for review May 21, 2026 06:53

vishesh-orkes added 2 commits May 21, 2026 12:23

Merge branch 'main' into feat/eval-observability-215

92431c2

Merge branch 'main' into feat/eval-observability-215

a5134a4

vishesh-orkes requested review from deeptireddy-lab and v1r3n May 26, 2026 14:55

Merge branch 'main' into feat/eval-observability-215

d77fc37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): add observability — SDK sends results to server, UI shows them#253

feat(eval): add observability — SDK sends results to server, UI shows them#253
vishesh-orkes wants to merge 6 commits into
mainfrom
feat/eval-observability-215

vishesh-orkes commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vishesh-orkes commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Eval Observability — Phase 1 (Issue #215)

What's new

Python SDK

Server

UI — Experiments (new sidebar section)

Tests

Out of scope (Phase 2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vishesh-orkes commented May 19, 2026 •

edited

Loading