World Knowledge

📝 Paper: Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration | arXiv

This is a web-agent pipeline for constructing reusable environment-specific knowledge and using it to improve downstream web task execution.

In the terminology of the paper, the core artifact is World Knowledge (K): a compact, structured representation of a specific environment instance. In this repository, that artifact is stored as Markdown and is historically referred to in some scripts as a notebook or guidebook. These terms refer to the same object.

Overview

This repository implements the web-environment pipeline behind the paper. The central idea is to split agent behavior into two phases:

Native Evolution Phase: the agent explores a previously unseen website and compresses its observations into World Knowledge.
Knowledge-Enhanced Execution Phase: the agent solves downstream tasks with that World Knowledge provided as external context.

At a high level, the workflow is:

Target website URLs
    ->
URL crawl / clustering
    ->
World Knowledge prompt generation
    ->
World Knowledge generation
    ->
Test set construction
    ->
CK-pro inference
    ->
LLM-based judging
    ->
Accuracy / efficiency analysis

World Knowledge is intended to be:

compact enough to fit into the agent context window,
structured around website regions and URL prefixes,
actionable for navigation and question answering,
environment-specific, functioning as a mental map of one concrete website.

World Knowledge improves both downstream effectiveness and efficiency: large absolute gains on WebWalker and WebVoyager, fewer execution steps, and cross-model transferability.

Setup

1. Environment

The current installation notes live in requirement.sh. In practice, the project expects:

Python 3.12
openai, jsonlines, transformers, matplotlib, and related Python packages
browser / scraping dependencies for CK web agents
Node.js and npm
Playwright/browser runtime dependencies
one or more vLLM-compatible inference endpoints

You also need to replace placeholder values in the shell scripts, such as:

PYTHONPATH
BROWSERLESS_TARGET_HOST
BROWSERLESS_TOKEN
HF_TOKEN
hard-coded /path/to/... values

2. Web servers

The web agent depends on a pool of web servers:

bash run_web_server.sh

3. Inference endpoints

The pipeline assumes accessible inference backends, with the current scripts defaulting to ports such as 8080-8083.

Evaluation

Evaluation in this repository follows the paper's World-Knowledge-enhanced web-agent setting.

Input preparation

Step 1: Crawl URLs

Start from one or more seed websites:

python preprocess/crawl_urls.py preprocess/urls.txt

This produces same-domain URL lists.

Step 2: Cluster URLs

Convert crawled URLs into the clustered website representation used for World Knowledge generation:

python preprocess/cluster_urls.py data/conference/

This matches the paper's input-processing design: websites are transformed into clustered, structured representations before World Knowledge generation.

Run the pipeline

Three main entry scripts are provided:

Script	Purpose
`data_pipeline_train.sh`	Full pipeline: World Knowledge prompt generation, World Knowledge generation, test data construction, CK-pro inference, judging, and analysis.
`data_pipeline_train_gen_only.sh`	Generation-only pipeline: World Knowledge prompt generation, World Knowledge generation, and test data construction.
`data_pipeline_train_test_only.sh`	Test-only pipeline: test data construction, CK-pro inference, judging, and analysis. Assumes World Knowledge Markdown files already exist.

All three scripts support two input modes:

MODE=urls: use the hard-coded urls=(...) array inside the script.
MODE=domain: scan data/${domain}/*_clusters.txt and automatically collect all URLs for that domain.

Full pipeline

MODE=urls bash data_pipeline_train.sh 5

or

MODE=domain bash data_pipeline_train.sh 5

World Knowledge generation only

MODE=urls bash data_pipeline_train_gen_only.sh 5

or

MODE=domain bash data_pipeline_train_gen_only.sh 5

Test + judge only

Use this when World Knowledge Markdown files already exist:

MODE=urls bash data_pipeline_train_test_only.sh 5

or

MODE=domain bash data_pipeline_train_test_only.sh 5

Analysis scripts

The main evaluation utilities are:

File	Purpose
`test_accuracy.py`	Uses an LLM judge to score correctness and writes judged outputs.
`test_efficency.py`	Computes average step counts, including sub-agent steps.
`calculate_effectiveness.py`	Aggregates judged outputs and plots accuracy across settings.

These correspond to the two major metrics emphasized in the paper:

effectiveness: downstream task success rate,
efficiency: number of execution steps.

Repository Structure

world-knowledge/
├── data/
│   └── conference/
├── preprocess/
│   ├── crawl_urls.py
│   ├── cluster_urls.py
│   └── urls.txt
├── questions/
│   └── notebook_prompt/
├── queue_file/
├── test_data/
├── output_note/
├── output_ans/
├── results/
├── pipeline_log/
├── System/
├── Evaluation/
├── dpo/
├── notebook_prompt.py
├── notebook_prompt_short.py
├── problem_generation_with_notebook.py
├── test_accuracy.py
├── test_efficency.py
├── calculate_effectiveness.py
├── data_pipeline_train.sh
├── data_pipeline_train_gen_only.sh
├── data_pipeline_train_test_only.sh
├── run_web_server.sh
└── requirement.sh

Citation

@article{zhang2026training,
  title={Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration},
  author={Zhang, Qifan and Ma, Dongyang and Fang, Tianqing and Li, Jia and Tang, Jing and Chen, Nuo and Mi, Haitao and Wang, Yan},
  journal={arXiv preprint arXiv:2604.18131},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

World Knowledge

Overview

Setup

1. Environment

2. Web servers

3. Inference endpoints

Evaluation

Input preparation

Step 1: Crawl URLs

Step 2: Cluster URLs

Run the pipeline

Full pipeline

World Knowledge generation only

Test + judge only

Analysis scripts

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
System		System
data		data
preprocess		preprocess
questions		questions
.gitignore		.gitignore
README.md		README.md
calculate_effectiveness.py		calculate_effectiveness.py
data_pipeline_train.sh		data_pipeline_train.sh
data_pipeline_train_gen_only.sh		data_pipeline_train_gen_only.sh
data_pipeline_train_test_only.sh		data_pipeline_train_test_only.sh
notebook_prompt.py		notebook_prompt.py
notebook_prompt_short.py		notebook_prompt_short.py
overview.jpg		overview.jpg
problem_generation_with_notebook.py		problem_generation_with_notebook.py
requirement.sh		requirement.sh
run_web_server.sh		run_web_server.sh
test_accuracy.py		test_accuracy.py
test_efficency.py		test_efficency.py

Folders and files

Latest commit

History

Repository files navigation

World Knowledge

Overview

Setup

1. Environment

2. Web servers

3. Inference endpoints

Evaluation

Input preparation

Step 1: Crawl URLs

Step 2: Cluster URLs

Run the pipeline

Full pipeline

World Knowledge generation only

Test + judge only

Analysis scripts

Repository Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages