feat(eval): add observability — SDK sends results to server, UI shows them#253
Open
vishesh-orkes wants to merge 6 commits into
Open
feat(eval): add observability — SDK sends results to server, UI shows them#253vishesh-orkes wants to merge 6 commits into
vishesh-orkes wants to merge 6 commits into
Conversation
0c6acdb to
8d62ae1
Compare
- Python SDK: CorrectnessEval posts EvalSuiteResult to server after each run;
EvalSuiteResult/EvalCaseResult/EvalCheckResult gain to_dict() serialization,
prompt/output capture per case, strategy/ran_by metadata, name field
- Server: new eval storage layer (schema-eval.sql + schema-eval-postgres.sql)
with eval_runs/eval_cases/eval_checks/eval_datasets tables; EvalController
exposes POST/GET /api/eval/runs, /api/eval/runs/{id}, /api/eval/datasets
and GET /api/eval/datasets/{name}; EvalService persists with @transactional
- Server: AgentService filters eval runs from production search by default;
isEvalRun() handles both JSON and Conductor Map.toString() serialization
formats; includeEvalRuns param restores them when requested
- UI: new Experiments section in sidebar with Eval Runs and Datasets pages;
EvalRunsList shows pass-rate progress bar, Cases pass/fail badges, stats row,
and search/filter bar; EvalRunDetail shows prompt, agent output, semantic score
card, strategy/ran_by metadata; DatasetsList/DatasetDetail show read-only
dataset cases
- UI: Agent Executions table gains Type column (Production/Eval chips) and
"Show eval runs" toggle next to "Hide sub-agent executions"; eval rows shown
at reduced opacity; footer links to Experiments → Eval Runs
- Tests: 19 Python unit tests (no LLM), EvalServiceTest for server persistence
c6aa811 to
2de23d2
Compare
… to avoid event-loop-closed error
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Eval Observability — Phase 1 (Issue #215)
When running
CorrectnessEvalfrom the Python SDK, there was no way to see results beyond the terminal. This PR closes that gap: eval results are automatically persisted and visible in a new Experiments section of the UI, and eval runs are kept separate from production agent traffic.What's new
Python SDK
CorrectnessEval.run()automatically sends the full suite result to the server after each run. Failures don't surface to the caller — eval logic is never blocked by network issues.name,strategy,ran_by, per-casepromptandoutput, and optionalscore/reasoningfor LLM-judge assertions.runtime.push_dataset(name, cases)stores a named, reusable test dataset on the server.tagswhen callingrun().Server
eval_runs,eval_cases,eval_checks, andeval_datasetstables (SQLite and Postgres).POST /api/eval/runs,GET /api/eval/runs,GET /api/eval/runs/{id},POST /api/eval/datasets,GET /api/eval/datasets,GET /api/eval/datasets/{name}.includeEvalRunsflag re-includes them when needed.UI — Experiments (new sidebar section)
Tests
Out of scope (Phase 2)