Skip to content

[Epic] CUGA Evaluation Framework - Accuracy, Consistency & Policy Compliance #239

@adiasaf

Description

@adiasaf

What You Want

Establish a comprehensive evaluation framework for CUGA to ensure the agent meets production-grade standards for accuracy, consistency, and policy compliance across diverse workloads and configurations.

Success Criteria

Primary KPIs:

  • ≥90% pass rate on policy-controlled task executions
  • 99% safety compliance (specific KPI to be defined)
  • Validated on open-source/open-model configurations
  • Tested on public benchmarks + real customer datasets

Initial Evaluation Targets

  1. Public Benchmarks

    • AppWorld: Demonstrate CUGA outperforms ReAct baseline
    • Vakra: Demonstrate CUGA outperforms ReAct baseline
  2. Real-World Validation

    • Evaluate on 2+ real customer datasets
    • Measure performance across different domains and use cases
  3. Future Benchmark Considerations

    • Tau: May be added as additional evaluation benchmark

Benchmarks

Metadata

Metadata

Assignees

Labels

component: evaluationBenchmarks and evaluationcomponent: policiesAgent policies and guardrailsenhancementNew feature or requestneeds-triageNewly filed, not yet reviewedtype: epicLarge body of work grouping multiple issues

Type

No type
No fields configured for issues without a type.

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions