What You Want
Establish a comprehensive evaluation framework for CUGA to ensure the agent meets production-grade standards for accuracy, consistency, and policy compliance across diverse workloads and configurations.
Success Criteria
Primary KPIs:
- ≥90% pass rate on policy-controlled task executions
- 99% safety compliance (specific KPI to be defined)
- Validated on open-source/open-model configurations
- Tested on public benchmarks + real customer datasets
Initial Evaluation Targets
-
Public Benchmarks
- AppWorld: Demonstrate CUGA outperforms ReAct baseline
- Vakra: Demonstrate CUGA outperforms ReAct baseline
-
Real-World Validation
- Evaluate on 2+ real customer datasets
- Measure performance across different domains and use cases
-
Future Benchmark Considerations
- Tau: May be added as additional evaluation benchmark
Benchmarks
What You Want
Establish a comprehensive evaluation framework for CUGA to ensure the agent meets production-grade standards for accuracy, consistency, and policy compliance across diverse workloads and configurations.
Success Criteria
Primary KPIs:
Initial Evaluation Targets
Public Benchmarks
Real-World Validation
Future Benchmark Considerations
Benchmarks