Web Evals Dashboard

A unified dashboard and API for exploring and comparing web-based AI agent evaluation benchmarks. Aggregates 4,725+ tasks from 10 major benchmarks into a single interface.

📚 Learn more: Browser Automation & Web Evals Overview

Benchmarks Included

GAIA (466 tasks) - General AI Assistant benchmark with 3 difficulty levels
Mind2Web (1,009 tasks) - Real-world website interaction tasks
Mind2Web2 (130 tasks) - Updated version with domain categorization
BrowseComp (1,266 tasks) - Web browsing comprehension tasks
WebArena (812 tasks) - Realistic web navigation scenarios
WebVoyager (643 tasks) - Long-horizon web navigation tasks
REAL (113 tasks) - Real-world web agent challenges with difficulty ratings
Bearcubs (111 tasks) - Web agent evaluation tasks
Agent-Company (175 tasks) - Domain-specific company tasks
OSWorld (400+ tasks) - Desktop application automation (Chrome, GIMP, LibreOffice, VS Code, etc.)

Features

Interactive Streamlit Dashboard - Filter, sort, and explore tasks
REST API - Programmatic access with full filtering and pagination
Unified Schema - Normalized data structure across all benchmarks
Advanced Filtering - By benchmark, difficulty, domain, website/app, and more
Task Search - Full-text search across task descriptions

Quick Start

# Clone and install dependencies
pip install -r requirements.txt

# Launch Streamlit dashboard
streamlit run main.py

# Or start the FastAPI server
python api.py
# API docs: http://localhost:8000/docs

API Examples

# Get all GAIA Level 1 tasks
curl "http://localhost:8000/tasks?benchmark=gaia&Level=1.0"

# Search for tasks about email
curl "http://localhost:8000/search?q=email&limit=10"

# Get WebArena tasks with pagination
curl "http://localhost:8000/tasks?benchmark=webarena&limit=50&offset=0"

# Get task by ID
curl "http://localhost:8000/tasks/4480"

Project Structure

├── main.py              # Streamlit dashboard
├── api.py               # FastAPI REST API
├── shared.py            # Data loading & normalization
├── requirements.txt     # Python dependencies
├── GAIA/               # GAIA benchmark data
├── WebVoyager/         # WebVoyager benchmark data
├── webarena/           # WebArena benchmark data
├── real-evals-agi/     # REAL benchmark data
├── OsWorld/            # OSWorld benchmark data
├── mind2web2/          # Mind2Web2 benchmark data
├── agent-company/      # Agent-Company benchmark data
├── bearcubs/           # Bearcubs benchmark data
└── openai-simple-evals/ # BrowseComp benchmark data

Data Schema

All tasks are normalized to a common format:

task_id - Unique identifier
Question - Task description/instruction
benchmark - Source benchmark name
web_name - Target website/application
domain / subdomain - Task categorization (when available)
difficulty / Level - Task difficulty (benchmark-specific)
web_url - Starting URL (when available)

Contributing

To add a new benchmark:

Add data to appropriate directory
Update DATASETS config in shared.py
Implement loader function if needed
Add normalization logic in normalize_task_data()

License

Data sources retain their original licenses. See individual benchmark repositories for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.images		.images
.vscode		.vscode
GAIA/2023		GAIA/2023
OsWorld/evaluation_examples		OsWorld/evaluation_examples
WebVoyager/data		WebVoyager/data
agent-company		agent-company
bearcubs		bearcubs
mind2web2		mind2web2
openai-simple-evals		openai-simple-evals
real-evals-agi/src/agisdk/REAL/browsergym/webclones/tasks		real-evals-agi/src/agisdk/REAL/browsergym/webclones/tasks
webarena/config_files		webarena/config_files
.gitignore		.gitignore
README.md		README.md
api.py		api.py
decrypt_browsercomp.py		decrypt_browsercomp.py
download_mind2web.py		download_mind2web.py
main.py		main.py
mind2web-train.csv		mind2web-train.csv
requirements.txt		requirements.txt
shared.py		shared.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Evals Dashboard

Benchmarks Included

Features

Quick Start

API Examples

Project Structure

Data Schema

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Evals Dashboard

Benchmarks Included

Features

Quick Start

API Examples

Project Structure

Data Schema

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages