agentspan-eval

Examples of evaluating AgentSpan-hosted agents. Each subfolder is one self-contained eval pattern — clone, cd in, follow the folder's README.

Examples

Folder	What it shows
`correctnessEval/`	Built-in `CorrectnessEval` — batched, dataset-style runs. Tool-usage + output-contents + LLM-as-judge semantic check for adversarial cases
`assertionsEval/`	Fine-grained, imperative assertions on a single run. Tool-used / order / output regex / max-turns / no-errors
`expectEval/`	Same single-run checks as assertionsEval, but using the fluent `expect(result).used_tool(...).output_contains(...)...` chain
`multiagentEval/`	Eval a 3-agent pipeline (`researcher >> writer >> editor`) over a topic dataset. Catches end-to-end coherence issues like "the editor strips key facts the researcher mentioned"
`semanticEval/`	LLM-as-judge checks on a single run — scores the output against natural-language criteria like "friendly tone", "covers both topics", "concise"
`mockEval/`	`mock_run` with a scripted event sequence — no LLM, no server, runs in <1s. Ideal for unit-test-style assertion coverage and regression checks

More to come (leaderboardEval/, humanFeedbackEval/).

AgentSpan server running locally on :6767 (or wherever CONDUCTOR_SERVER_URL points)
agentspan Python SDK installed
An LLM API key set per example (OPENAI_API_KEY, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assertionsEval		assertionsEval
correctnessEval		correctnessEval
expectEval		expectEval
mockEval		mockEval
multiagentEval		multiagentEval
semanticEval		semanticEval
.gitignore		.gitignore
README.md		README.md