Examples of evaluating AgentSpan-hosted agents. Each subfolder is one
self-contained eval pattern — clone, cd in, follow the folder's README.
| Folder | What it shows |
|---|---|
correctnessEval/ |
Built-in CorrectnessEval — batched, dataset-style runs. Tool-usage + output-contents + LLM-as-judge semantic check for adversarial cases |
assertionsEval/ |
Fine-grained, imperative assertions on a single run. Tool-used / order / output regex / max-turns / no-errors |
expectEval/ |
Same single-run checks as assertionsEval, but using the fluent expect(result).used_tool(...).output_contains(...)... chain |
multiagentEval/ |
Eval a 3-agent pipeline (researcher >> writer >> editor) over a topic dataset. Catches end-to-end coherence issues like "the editor strips key facts the researcher mentioned" |
semanticEval/ |
LLM-as-judge checks on a single run — scores the output against natural-language criteria like "friendly tone", "covers both topics", "concise" |
mockEval/ |
mock_run with a scripted event sequence — no LLM, no server, runs in <1s. Ideal for unit-test-style assertion coverage and regression checks |
More to come (leaderboardEval/, humanFeedbackEval/).
- AgentSpan server running locally on
:6767(or whereverCONDUCTOR_SERVER_URLpoints) agentspanPython SDK installed- An LLM API key set per example (
OPENAI_API_KEY, etc.)