@tangle-network/agent-eval

Evaluation infrastructure for agent products.

Use it to wrap the real workflow your users run, record what happened, verify the result, turn feedback into replay data, compare variants, and ship only when the evidence improves.

product task
  -> observe state
  -> validate with deterministic gates first
  -> act through the real product adapter
  -> trace + feedback trajectory
  -> replay / optimize / release gate

agent-eval does not own product state, credentials, UI, storage, model routing, browser drivers, sandbox policy, or deployment. Products own those. This package owns eval contracts, loop mechanics, traces, statistics, optimization inputs, and release evidence.

Install

pnpm add @tangle-network/agent-eval

Quick Start

import {
  objectiveEval,
  runAgentControlLoop,
} from '@tangle-network/agent-eval/control'

const result = await runAgentControlLoop({
  intent: task.prompt,
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },

  observe() {
    return product.readState(task.id)
  },

  validate({ state }) {
    return [
      objectiveEval({
        id: 'build-passes',
        passed: state.build.exitCode === 0,
        severity: 'critical',
        metadata: state.build,
      }),
      objectiveEval({
        id: 'preview-serves',
        passed: state.preview.httpStatus === 200,
        severity: 'critical',
      }),
    ]
  },

  decide({ evals }) {
    const failed = evals.filter((e) => !e.passed)
    if (failed.length === 0) {
      return { type: 'stop', pass: true, reason: 'all gates passed' }
    }
    return {
      type: 'continue',
      action: { type: 'repair', failed: failed.map((e) => e.id) },
      reason: 'repair failed gates',
    }
  },

  act(action) {
    return product.runAgentStep(task.id, action)
  },
})

await product.storeEvalResult(task.id, result)

That loop should be the same shape in production, replay, benchmark, and optimization. Swap dependencies behind observe() and act(), not the eval contract itself.

Import Paths

The root export remains available, but new code should prefer focused subpaths:

import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
import { TraceEmitter } from '@tangle-network/agent-eval/traces'
import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'

Subpath	Use for
`@tangle-network/agent-eval/control`	`observe -> validate -> decide -> act`, action policy, propose/review loops
`@tangle-network/agent-eval/traces`	trace stores, emitters, TraceAnalyst
`@tangle-network/agent-eval/optimization`	feedback trajectories, multi-shot optimization, prompt evolution
`@tangle-network/agent-eval/reporting`	release confidence, paired stats, report/table/chart specs
`@tangle-network/agent-eval/wire`	HTTP/RPC judge server and schemas
`@tangle-network/agent-eval/benchmarks`	benchmark adapter contracts and reference wrappers

Core Pieces

Need	Use
Keep an agent working until objective state passes	`runAgentControlLoop`
Turn user/reviewer feedback into replay data	`FeedbackTrajectory`
Compare prompt/tool/retrieval policies over full trajectories	`runMultiShotOptimization`
Gate releases with paired evidence and holdouts	`evaluateReleaseConfidence`, `HeldOutGate`
Explain regressions across trace corpora	`TraceAnalyst` / `analyzeTraces`
Report a launch decision	`renderReleaseReport`, `researchReport`, `summaryTable`, `paretoChart`, `gainHistogram`
Capture every provider HTTP request / response for forensics	`RawProviderSink`, `LlmClientOptions.rawSink`
Fail loud if an eval would silently use the wrong route	`assertLlmRoute`
Assert at run-end that the artifact is complete	`assertRunCaptured`, `throwIfRunIncomplete`
Auto-execute the trace analyst on every run	`traceAnalystOnRunComplete` + `TraceEmitterOptions.onRunComplete`
Model missing context separately from bad reasoning	`KnowledgeRequirement`, `KnowledgeBundle`

Capture integrity (0.21+)

Launch-grade benchmark runs need four things that are easy to forget in glue code: (1) raw HTTP capture alongside the structured spans so a reviewer can verify which route answered, (2) a preflight assertion that the configured client points at the intended provider, (3) a run-end assertion that the expected events were actually written, and (4) auto-execution of the trace analyst as part of the run lifecycle. The wiring fits in a few lines:

import {
  TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
  assertRunCaptured, throwIfRunIncomplete,
} from '@tangle-network/agent-eval'
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'

const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })

const emitter = new TraceEmitter(store, {
  onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
})
await emitter.startRun(/* ... */)
// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
await emitter.endRun({ pass, score })

throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
  llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
}))

Directives, rationale, and shipped-bug context are in SKILL.md § Capture integrity.

Examples

Runnable examples live in examples/.

examples/multi-shot-optimization: optimize full trajectories with held-out promotion.
examples/same-sandbox-harness: run setup/build/test and evidence checks in one workspace.
examples/benchmarks: benchmark adapter shape and reference wrappers.

Docs

Read in this order:

CLI / Wire Protocol

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

The Python client lives in clients/python:

cd clients/python
pip install -e .

Development

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi

Related Packages

@tangle-network/agent-runtime: production session/runtime layer.
@tangle-network/agent-knowledge: source-grounded knowledge bases and readiness.
@tangle-network/agent-integrations: connection, grant, capability, and integration invocation contracts.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.claude/skills/agent-eval		.claude/skills/agent-eval
.github/workflows		.github/workflows
clients/python		clients/python
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@tangle-network/agent-eval

Install

Quick Start

Import Paths

Core Pieces

Capture integrity (0.21+)

Examples

Docs

CLI / Wire Protocol

Development

Related Packages

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@tangle-network/agent-eval

Install

Quick Start

Import Paths

Core Pieces

Capture integrity (0.21+)

Examples

Docs

CLI / Wire Protocol

Development

Related Packages

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages