Hardest β vague multi-turn proactive search in the wild.
Verifiable β schema-free knowledge graph evaluation.
Long-horizon β persona-driven progressive disclosure.
Browse the full leaderboard and multi-turn task trajectories at vibebench.github.io/VibeSearchBench.github.io.
Evaluation:
- Primary metric: Triplet F1. Predicted knowledge graphs are matched against ground truth via LLM-as-judge node alignment and triplet semantic equivalence.
- Multi-turn interaction. Each task uses a persona-driven user simulator with progressive disclosure; agents may search, visit pages, and run code across many turns.
- Best reported score: 30.3 triplet F1 (Claude Opus 4.6, OpenClaw).
200 tasks across 2 subsets and 20 domains. Each task pairs a vague initial query with a ground-truth knowledge graph.
| Split | Count | Description |
|---|---|---|
pro |
100 | Professional research β literature reviews, market analysis, technical due diligence |
daily |
100 | Daily-life search β shopping, travel, lifestyle with evolving preferences |
Real users rarely specify full intent upfront. VibeSearch captures bidirectional convergence: agents interleave partial results with follow-up questions while users progressively disclose needs.
Available on Hugging Face: VibeSearchBench/VibeSearchBench
| Field | Description |
|---|---|
qid |
Unique task identifier |
question |
Full research query with constraints |
user_persona |
Persona for the progressive-disclosure simulator |
nodes / triples |
Ground-truth knowledge graph |
Uses an OpenAI-compatible LLM to drive multi-step web research.
# Full pipeline (inference + evaluation)
MODEL_NAME=glm-5.1 VLLM_URL=http://host/v1 bash scripts/run_all.sh
# Inference only
MODEL_NAME=kimi-k2.5 VLLM_URL=http://host/v1 bash scripts/run_inference.sh
# With model config profile
MODEL_CONFIG=model_config.yaml MODEL_PROFILE=seed2_0_pro bash scripts/run_all.shWraps the OpenClaw CLI into the benchmark. Requires a running OpenClaw gateway.
# Default (simulated mode)
bash scripts/run_openclaw.sh
# Direct mode (no user simulation)
MODE=direct bash scripts/run_openclaw.sh
# Custom data and model
DATA_PATH=tasks/my_tasks MODE=simulated OPENCLAW_MODEL=my-model bash scripts/run_openclaw.shKey OpenClaw env vars: GATEWAY_PORT (default 18789), SOURCE_DIR, IDLE_THRESHOLD, MAX_NUDGE, OPENCLAW_MODEL.
TRAJS_DIR=results/trajs/glm-5.1_custom_serper bash scripts/run_eval.sh# GeneralAgent: full pipeline
python run.py \
--agent-type general \
--model glm-5.1 \
--vllm-server-url http://host/v1 \
--tool-set custom \
--num-samples 4 \
--grader-type gemini \
--grader-api-url https://... \
--grader-api-key YOUR_KEY
# GeneralAgent: inference only
python run.py \
--agent-type general \
--model glm-5.1 \
--vllm-server-url http://host/v1 \
--skip-eval
# OpenClaw agent
python run.py \
--agent-type openclaw \
--gateway-port 18789 \
--mode simulated \
--user-model doubao-seed-2-0-pro \
--user-model-url http://host/v1 \
--user-model-api-key YOUR_KEY \
--num-samples 4
# Eval only
python run.py \
--eval-only \
--trajs-dir results/trajs/glm-5.1_custom_serper \
--grader-type gemini \
--grader-api-url https://...VibeSearchBench/
βββ agent/ # Agent implementations
β βββ general_agent.py # GeneralAgent (OpenAI-compatible, single/multi-agent)
β βββ openclaw_agent.py # OpenClaw agent wrapper
β βββ llm.py # LLM client utilities
β βββ prompts.py # Prompt templates
β βββ toolkit.py # ToolKit (search / visit / python via Serper)
βββ eval/ # Evaluation module
β βββ grader.py # GraderClient (OpenAI / Gemini backends)
β βββ evaluator.py # KG evaluation: node F1, triplet F1
βββ scripts/ # Bash/Python scripts
β βββ run_all.sh # Full pipeline (inference + evaluation)
β βββ run_inference.sh # Agent inference only
β βββ run_eval.sh # Evaluation only
β βββ run_openclaw.sh # OpenClaw evaluation
β βββ build_website_data.py # Export data for the project page
βββ viberesearch_query_synthesis/ # Query synthesis module
βββ website/ # Static site template (deployed via github.io repo)
βββ tasks/ # Task JSON files (benchmark data)
βββ results/ # Output (auto-created)
βββ model_config.yaml # LLM model profiles
βββ run.py # Main entry point
| Variable | Description | Default |
|---|---|---|
MODEL_NAME |
Model name for chat API | glm-5.1 |
VLLM_URL |
Base URL for chat API | (none) |
TOOL_SET |
custom or builtin |
custom |
API_KEY |
API key for main model | (empty) |
MULTI_AGENT |
Set to 1 for multi-agent mode |
0 |
SERPER_API_KEY |
Serper API key for web search | (preset) |
SUMMARIZE_URL |
vLLM URL for page summarization | (preset) |
SUMMARIZE_MODEL |
Model for summarization | qwen3-30b-a3b-instruct |
CODE_SANDBOX_URL |
HTTP sandbox for Python tool | (preset) |
GEMINI_API_KEY |
API key for Gemini grader | (preset) |
GEMINI_API_URL |
API URL for Gemini grader | (preset) |
- custom (default): search (Serper) + visit (Serper scrape + LLM summarize) + python (HTTP sandbox)
- builtin: search + open + find (requires
gpt_osspackage)
- Single-agent: One agent handles the entire query
- Multi-agent (
MULTI_AGENT=1): Main agent can spawn sub-agents for parallel research
One JSONL file per task ({task_id}.jsonl), each line is one sample:
{"qid": "task_042_...", "sample_idx": 0, "question": "...", "messages": [...], "response": "...", "termination": "answer", ...}{task_id}_sample{N}.jsonβ Per-trajectory evaluation with node/triplet metricsitem_ratings.jsonβ All per-item resultssummary.jsonβ Aggregated metrics (avg@N, best@N)
openai aiohttp httpx tqdm transformers json_repair
Two-phase LLM-as-judge evaluation:
- Node matching: LLM matches predicted entities to ground-truth entities (alias/translation-aware)
- Triplet matching: For matched entity pairs, LLM judges relation semantic equivalence
Metrics: Precision, Recall, F1 at both node and triplet levels, with avg@N and best@N aggregation across samples.
This project is released under the MIT License.
VibeSearchBench Β· Rednote-Hilab & Unipat AI