llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

Head-to-head benchmark suite comparing web crawlers on speed, extraction quality, retrieval quality, LLM answer quality, and cost at scale. Every benchmark is reproducible from a single command.

Why this exists

Most crawler benchmarks test one dimension (speed or extraction accuracy) in isolation. But in an LLM/RAG pipeline, the crawler is stage 1 -- everything downstream (chunking, embedding, retrieval, LLM generation) depends on what the crawler produces. A tool that is fast but outputs noisy Markdown will inflate your embedding costs and degrade retrieval quality.

This project measures the full pipeline: crawl, chunk, embed, retrieve, and generate an LLM answer -- then scores each stage independently so you can see where the differences actually matter.

Key Findings

Dimension	Winner	Key metric	Runner-up
Speed	scrapy+md	5.0 pages/sec	markcrawl (2.7 p/s)
Extraction quality	markcrawl	99% content signal, 53 words preamble	scrapy+md (92%, 500 words)
Retrieval quality	crawlee	94% Hit@10, 0.765 MRR	crawl4ai-raw (95%, 0.763)
LLM answer quality	crawl4ai	4.72/5 overall score	crawl4ai-raw (4.70/5)
Cost at scale	markcrawl	$4,505/yr (100K pages, 1K q/day)	scrapy+md ($5,464/yr)
Pipeline timing	markcrawl	440.7s end-to-end, $0.40	scrapy+md (451.3s, $1.05)

Leaderboard (Benchmark v2.0)

All 7 tools, sorted by speed. 8 sites, 109 retrieval queries, scored on 5 dimensions.

Tool	Speed (p/s)	Content Signal	MRR	Answer (/5)	Cost (100K/yr)
scrapy+md	5.0	92%	0.176	3.68	$5,464
markcrawl	2.7	99%	0.341	3.77	$4,505
playwright	2.5	67%	0.758	4.48	$7,320
crawl4ai	1.4	82%	0.757	4.72	$6,960
crawl4ai-raw	1.4	83%	0.763	4.70	$6,961
crawlee	1.3	68%	0.765	4.68	$7,467
colly+md	1.0	70%	0.459	4.36	$7,213

Column definitions: Speed = pages/sec (median of 3 runs). Content Signal = (total words - preamble) / total words (higher = cleaner). MRR = Mean Reciprocal Rank, best retrieval mode per tool. Answer = LLM answer quality scored 1-5 by gpt-4o-mini. Cost = annual RAG pipeline cost at 100K pages, 1K queries/day.

Bottom line: scrapy+md is the fastest at 5.0 pages/sec; markcrawl is at 2.7. It also wins on pipeline timing ($0.40 end-to-end) and extraction quality (99% content signal). Answer quality is tight across all tools (3.68-4.72/5), with crawl4ai narrowly leading. Retrieval quality barely differs between tools -- switching retrieval mode (e.g., to reranked) gains more than switching crawlers.

Tools Compared

Tool	Type	JS rendering	Notes
markcrawl	Python	Optional	Markdown-first, lowest preamble
scrapy+md	Python	No	Fastest raw HTTP crawler
crawl4ai	Python	Built-in	AI-native, browser-based
crawl4ai-raw	Python	Built-in	crawl4ai with raw HTML output
colly+md	Go	No	Fast compiled crawler
crawlee	Python	Built-in	Apify's browser crawler
playwright	Python	Built-in	Microsoft's browser automation

All tools output Markdown via the same html-to-markdown pipeline (except crawl4ai-raw). See METHODOLOGY.md for tool configurations and fairness decisions.

Sites Tested

Site	Pages	Type
quotes.toscrape.com	15	Simple paginated HTML
books.toscrape.com	60	E-commerce catalog
fastapi.tiangolo.com	153	API docs (code blocks, tutorials)
docs.python.org	500	Standard library reference
react.dev	500	SPA, JS-rendered
en.wikipedia.org	50	Tables, infoboxes, citations
docs.stripe.com	500	Tabbed content, code samples
github.blog	200	Blog articles, images

Reports

Report	Question it answers
Speed Comparison	Which crawler is fastest?
Quality Comparison	Which produces the cleanest Markdown?
Retrieval Comparison	Does cleaner Markdown improve retrieval?
Answer Quality	Does better retrieval improve LLM answers?
Cost at Scale	What does each crawler cost at 100K+ pages?
Pipeline Timing	How long does the full RAG pipeline take?
MarkCrawl Self-Benchmark	MarkCrawl standalone performance
Methodology	How were these benchmarks run?

Transparency

This benchmark is maintained by the creators of markcrawl, one of the tools tested. We designed the methodology to be fair (identical seed URLs, randomized execution order, published scripts), but readers should be aware of this relationship. All code and data are published so results can be independently verified. If you rerun and get different results, open an issue.

Limitations

Single machine, single location. All benchmarks run on one machine in one geographic location. Network latency to each site varies by location, so absolute pages/sec numbers will differ on your hardware.
Live sites introduce variance. These are real public websites, not frozen snapshots. Server load, CDN caching, and content changes cause run-to-run variance. We report medians across 3 iterations to reduce noise.
Markdown output only. We evaluate Markdown extraction quality. Tools that excel at structured data extraction (JSON, tables) may rank differently on those tasks.
7 tools, not all crawlers. We test the most common open-source crawlers used in RAG pipelines. Tools like Apify, ScrapingBee, and others are not included. See "Including a tool" below to add one.
LLM-judged quality. Answer quality is scored by gpt-4o-mini, not human reviewers. LLM judges have known biases (verbosity preference, position effects). We mitigate with 4-dimension scoring but the scores are not ground truth.
No anti-bot, authentication, or JS-heavy SPA testing. All test sites are publicly accessible and crawler-friendly. Results do not apply to sites with bot detection, rate limiting, or login walls.

Including a tool

We welcome contributions. To add a crawler to the benchmark:

Open an issue describing the tool, its license, and what makes it relevant for LLM/RAG pipelines.
Submit a PR with a runner script in runners/ that matches the interface of existing runners (accepts a URL list, outputs Markdown + JSONL index). See runners/README.md for the spec.
The tool must be open-source with a published package (pip, npm, go module, etc.).
We run all benchmarks on the same hardware with the same sites and queries. You don't need to provide benchmark results -- just the runner.

Reproducing these results

# Install dependencies
pip install -e ".[dev]"

# Preflight check (verifies all tools are installed)
python preflight.py

# Run all benchmarks (~3-5 hours)
python benchmark_all_tools.py

# Run individual benchmarks
python benchmark_quality.py
python benchmark_retrieval.py
python benchmark_answer_quality.py
python benchmark_pipeline.py
python benchmark_markcrawl.py

# Regenerate this README from report data
python generate_readme.py

Docker

docker build -t llm-crawler-benchmarks .
docker run --rm \
  -e OPENAI_API_KEY \
  -v $(pwd)/reports:/app/reports \
  -v $(pwd)/runs:/app/runs \
  llm-crawler-benchmarks

Benchmark version

v2.0 -- 2026-05-12

When benchmark methodology changes (new sites, different scoring, updated tool versions), we increment the version. Results from different versions are not directly comparable. See METHODOLOGY.md for the full test setup.

Related Work

Other projects benchmark parts of the web scraping pipeline:

Firecrawl scrape-evals -- 1,000-URL extraction quality benchmark (precision/recall). Single-page quality only; no speed, retrieval, or LLM answer evaluation.
WCXB -- 2,008-page content extraction leaderboard with word-level F1. Covers traditional tools (trafilatura, readability) but not LLM-era crawlers.
Spider.cloud benchmark -- 3-tool comparison (Firecrawl, Crawl4AI, Spider) on throughput, cost, and RAG retrieval accuracy.

This project differs by evaluating the full RAG pipeline -- from crawl through chunk, embed, retrieve, and LLM answer -- across 7 tools, 8 sites, and 5 dimensions including downstream answer quality and cost at scale.

Self-Improvement Framework

The self_improvement/ directory contains a 9-spec review framework for auditing benchmark quality. See self_improvement/MASTER.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
docs		docs
queries		queries
reports		reports
runners		runners
scripts		scripts
self_improvement		self_improvement
sites		sites
specs		specs
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
IMPROVE.md		IMPROVE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark_all_tools.py		benchmark_all_tools.py
benchmark_answer_quality.py		benchmark_answer_quality.py
benchmark_markcrawl.py		benchmark_markcrawl.py
benchmark_pipeline.py		benchmark_pipeline.py
benchmark_quality.py		benchmark_quality.py
benchmark_retrieval.py		benchmark_retrieval.py
crawlee_worker.py		crawlee_worker.py
generate_readme.py		generate_readme.py
lint_reports.py		lint_reports.py
models_manifest.py		models_manifest.py
preflight.py		preflight.py
pyproject.toml		pyproject.toml
quality_scorer.py		quality_scorer.py
report_utils.py		report_utils.py
run_benchmarks.sh		run_benchmarks.sh
test_crawl4ai_graduated.py		test_crawl4ai_graduated.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

Why this exists

Key Findings

Leaderboard (Benchmark v2.0)

Tools Compared

Sites Tested

Reports

Transparency

Limitations

Including a tool

Reproducing these results

Docker

Benchmark version

Related Work

Self-Improvement Framework

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

Why this exists

Key Findings

Leaderboard (Benchmark v2.0)

Tools Compared

Sites Tested

Reports

Transparency

Limitations

Including a tool

Reproducing these results

Docker

Benchmark version

Related Work

Self-Improvement Framework

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages