Skip to content

Integrate AstaBench for Agent Evaluation #12

@wonderwomancode

Description

@wonderwomancode

Overview

Integrate AstaBench from AllenAI to evaluate and benchmark agent performance.

Why AstaBench

  • Open source agent evaluation framework (built on InspectAI)
  • Not limited to scientific research - works for any agents
  • Provides baseline comparisons
  • Tracks cost efficiency alongside quality

GitHub Repos

Custom Benchmarks to Create

Benchmark What It Measures
Handoff Latency Time from Senku delegation → Yusuke completion
Context Preservation Does handoff lose important context?
Parallel Efficiency Do 3 agents finish 3x faster than 1?
Error Recovery Can agents recover when one fails?
Cost Efficiency Tokens used vs task complexity

Acceptance Criteria

  • Install AstaBench: pip install astabench
  • Create custom benchmark suite for handoffs
  • Create benchmark for context preservation
  • Create benchmark for parallel efficiency
  • Run baseline evaluation on current agent setup
  • Document results and improvement areas
  • Integrate with CI for regression testing

Example Usage

# Run evaluation
astabench eval --agent senku --benchmark handoff-latency

# View results
astabench view results/

Assigned Agent

@quinn - Testing and evaluation is your domain

Related

  • Langfuse integration (for real-time monitoring)
  • NATS KV (for memory that benchmarks test)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions