A unified dashboard and API for exploring and comparing web-based AI agent evaluation benchmarks. Aggregates 4,725+ tasks from 10 major benchmarks into a single interface.
📚 Learn more: Browser Automation & Web Evals Overview
- GAIA (466 tasks) - General AI Assistant benchmark with 3 difficulty levels
- Mind2Web (1,009 tasks) - Real-world website interaction tasks
- Mind2Web2 (130 tasks) - Updated version with domain categorization
- BrowseComp (1,266 tasks) - Web browsing comprehension tasks
- WebArena (812 tasks) - Realistic web navigation scenarios
- WebVoyager (643 tasks) - Long-horizon web navigation tasks
- REAL (113 tasks) - Real-world web agent challenges with difficulty ratings
- Bearcubs (111 tasks) - Web agent evaluation tasks
- Agent-Company (175 tasks) - Domain-specific company tasks
- OSWorld (400+ tasks) - Desktop application automation (Chrome, GIMP, LibreOffice, VS Code, etc.)
- Interactive Streamlit Dashboard - Filter, sort, and explore tasks
- REST API - Programmatic access with full filtering and pagination
- Unified Schema - Normalized data structure across all benchmarks
- Advanced Filtering - By benchmark, difficulty, domain, website/app, and more
- Task Search - Full-text search across task descriptions
# Clone and install dependencies
pip install -r requirements.txt
# Launch Streamlit dashboard
streamlit run main.py
# Or start the FastAPI server
python api.py
# API docs: http://localhost:8000/docs# Get all GAIA Level 1 tasks
curl "http://localhost:8000/tasks?benchmark=gaia&Level=1.0"
# Search for tasks about email
curl "http://localhost:8000/search?q=email&limit=10"
# Get WebArena tasks with pagination
curl "http://localhost:8000/tasks?benchmark=webarena&limit=50&offset=0"
# Get task by ID
curl "http://localhost:8000/tasks/4480"├── main.py # Streamlit dashboard
├── api.py # FastAPI REST API
├── shared.py # Data loading & normalization
├── requirements.txt # Python dependencies
├── GAIA/ # GAIA benchmark data
├── WebVoyager/ # WebVoyager benchmark data
├── webarena/ # WebArena benchmark data
├── real-evals-agi/ # REAL benchmark data
├── OsWorld/ # OSWorld benchmark data
├── mind2web2/ # Mind2Web2 benchmark data
├── agent-company/ # Agent-Company benchmark data
├── bearcubs/ # Bearcubs benchmark data
└── openai-simple-evals/ # BrowseComp benchmark data
All tasks are normalized to a common format:
task_id- Unique identifierQuestion- Task description/instructionbenchmark- Source benchmark nameweb_name- Target website/applicationdomain/subdomain- Task categorization (when available)difficulty/Level- Task difficulty (benchmark-specific)web_url- Starting URL (when available)
To add a new benchmark:
- Add data to appropriate directory
- Update
DATASETSconfig inshared.py - Implement loader function if needed
- Add normalization logic in
normalize_task_data()
Data sources retain their original licenses. See individual benchmark repositories for details.
