Skip to content

feat(eval): add observability — SDK sends results to server, UI shows them#253

Open
vishesh-orkes wants to merge 6 commits into
mainfrom
feat/eval-observability-215
Open

feat(eval): add observability — SDK sends results to server, UI shows them#253
vishesh-orkes wants to merge 6 commits into
mainfrom
feat/eval-observability-215

Conversation

@vishesh-orkes

@vishesh-orkes vishesh-orkes commented May 19, 2026

Copy link
Copy Markdown

Eval Observability — Phase 1 (Issue #215)

When running CorrectnessEval from the Python SDK, there was no way to see results beyond the terminal. This PR closes that gap: eval results are automatically persisted and visible in a new Experiments section of the UI, and eval runs are kept separate from production agent traffic.


What's new

Python SDK

  • CorrectnessEval.run() automatically sends the full suite result to the server after each run. Failures don't surface to the caller — eval logic is never blocked by network issues.
  • Eval results now carry richer metadata: run name, strategy, ran_by, per-case prompt and output, and optional score / reasoning for LLM-judge assertions.
  • runtime.push_dataset(name, cases) stores a named, reusable test dataset on the server.
  • Cases can be filtered by tags when calling run().

Server

  • New eval_runs, eval_cases, eval_checks, and eval_datasets tables (SQLite and Postgres).
  • REST endpoints: POST /api/eval/runs, GET /api/eval/runs, GET /api/eval/runs/{id}, POST /api/eval/datasets, GET /api/eval/datasets, GET /api/eval/datasets/{name}.
  • Eval runs are excluded from Agent Executions search results by default. An includeEvalRuns flag re-includes them when needed.

UI — Experiments (new sidebar section)

  • Eval Runs: table with pass rate bars, pass/fail case counts, search, and agent/result filters. Click any row to see the full run detail — cases, per-check results, agent output, and semantic score with LLM reasoning where applicable.
  • Datasets: split-panel view listing all pushed datasets on the left; click to inspect individual cases on the right.
  • Agent Executions: a new Type column distinguishes production runs from eval runs. A "Show eval runs" toggle (next to "Hide sub-agent executions") reveals them when needed.

Tests

  • Python unit tests covering serialization, session tagging, result posting, tags filtering, and metadata fields — no LLM calls.
  • Spring Boot integration tests for all eval storage operations.
  • Playwright E2E tests for the Eval Runs and Datasets pages.

Out of scope (Phase 2)

  • Triggering an eval run directly from the UI
  • Creating or editing dataset cases in the UI
  • Model comparison / benchmarks view

Screenshot 2026-05-27 at 8 27 00 PM Screenshot 2026-05-27 at 8 27 11 PM Screenshot 2026-05-27 at 8 27 18 PM

@vishesh-orkes vishesh-orkes force-pushed the feat/eval-observability-215 branch from 0c6acdb to 8d62ae1 Compare May 19, 2026 13:03
@vishesh-orkes vishesh-orkes changed the title feat(eval): add observability — SDK sends results to server, UI shows… feat(eval): add observability — SDK sends results to server, UI shows them May 19, 2026
- Python SDK: CorrectnessEval posts EvalSuiteResult to server after each run;
  EvalSuiteResult/EvalCaseResult/EvalCheckResult gain to_dict() serialization,
  prompt/output capture per case, strategy/ran_by metadata, name field
- Server: new eval storage layer (schema-eval.sql + schema-eval-postgres.sql)
  with eval_runs/eval_cases/eval_checks/eval_datasets tables; EvalController
  exposes POST/GET /api/eval/runs, /api/eval/runs/{id}, /api/eval/datasets
  and GET /api/eval/datasets/{name}; EvalService persists with @transactional
- Server: AgentService filters eval runs from production search by default;
  isEvalRun() handles both JSON and Conductor Map.toString() serialization
  formats; includeEvalRuns param restores them when requested
- UI: new Experiments section in sidebar with Eval Runs and Datasets pages;
  EvalRunsList shows pass-rate progress bar, Cases pass/fail badges, stats row,
  and search/filter bar; EvalRunDetail shows prompt, agent output, semantic score
  card, strategy/ran_by metadata; DatasetsList/DatasetDetail show read-only
  dataset cases
- UI: Agent Executions table gains Type column (Production/Eval chips) and
  "Show eval runs" toggle next to "Hide sub-agent executions"; eval rows shown
  at reduced opacity; footer links to Experiments → Eval Runs
- Tests: 19 Python unit tests (no LLM), EvalServiceTest for server persistence
@vishesh-orkes vishesh-orkes force-pushed the feat/eval-observability-215 branch from c6aa811 to 2de23d2 Compare May 20, 2026 16:56
@vishesh-orkes vishesh-orkes marked this pull request as ready for review May 21, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant