Head-to-head benchmark suite comparing web crawlers on speed, extraction quality, retrieval quality, LLM answer quality, and cost at scale. Every benchmark is reproducible from a single command.
Most crawler benchmarks test one dimension (speed or extraction accuracy) in isolation. But in an LLM/RAG pipeline, the crawler is stage 1 -- everything downstream (chunking, embedding, retrieval, LLM generation) depends on what the crawler produces. A tool that is fast but outputs noisy Markdown will inflate your embedding costs and degrade retrieval quality.
This project measures the full pipeline: crawl, chunk, embed, retrieve, and generate an LLM answer -- then scores each stage independently so you can see where the differences actually matter.
| Dimension | Winner | Key metric | Runner-up |
|---|---|---|---|
| Speed | scrapy+md | 5.0 pages/sec | markcrawl (2.7 p/s) |
| Extraction quality | markcrawl | 99% content signal, 53 words preamble | scrapy+md (92%, 500 words) |
| Retrieval quality | crawlee | 94% Hit@10, 0.765 MRR | crawl4ai-raw (95%, 0.763) |
| LLM answer quality | crawl4ai | 4.72/5 overall score | crawl4ai-raw (4.70/5) |
| Cost at scale | markcrawl | $4,505/yr (100K pages, 1K q/day) | scrapy+md ($5,464/yr) |
| Pipeline timing | markcrawl | 440.7s end-to-end, $0.40 | scrapy+md (451.3s, $1.05) |
All 7 tools, sorted by speed. 8 sites, 109 retrieval queries, scored on 5 dimensions.
| Tool | Speed (p/s) | Content Signal | MRR | Answer (/5) | Cost (100K/yr) |
|---|---|---|---|---|---|
| scrapy+md | 5.0 | 92% | 0.176 | 3.68 | $5,464 |
| markcrawl | 2.7 | 99% | 0.341 | 3.77 | $4,505 |
| playwright | 2.5 | 67% | 0.758 | 4.48 | $7,320 |
| crawl4ai | 1.4 | 82% | 0.757 | 4.72 | $6,960 |
| crawl4ai-raw | 1.4 | 83% | 0.763 | 4.70 | $6,961 |
| crawlee | 1.3 | 68% | 0.765 | 4.68 | $7,467 |
| colly+md | 1.0 | 70% | 0.459 | 4.36 | $7,213 |
Column definitions: Speed = pages/sec (median of 3 runs). Content Signal = (total words - preamble) / total words (higher = cleaner). MRR = Mean Reciprocal Rank, best retrieval mode per tool. Answer = LLM answer quality scored 1-5 by gpt-4o-mini. Cost = annual RAG pipeline cost at 100K pages, 1K queries/day.
Bottom line: scrapy+md is the fastest at 5.0 pages/sec; markcrawl is at 2.7. It also wins on pipeline timing ($0.40 end-to-end) and extraction quality (99% content signal). Answer quality is tight across all tools (3.68-4.72/5), with crawl4ai narrowly leading. Retrieval quality barely differs between tools -- switching retrieval mode (e.g., to reranked) gains more than switching crawlers.
| Tool | Type | JS rendering | Notes |
|---|---|---|---|
| markcrawl | Python | Optional | Markdown-first, lowest preamble |
| scrapy+md | Python | No | Fastest raw HTTP crawler |
| crawl4ai | Python | Built-in | AI-native, browser-based |
| crawl4ai-raw | Python | Built-in | crawl4ai with raw HTML output |
| colly+md | Go | No | Fast compiled crawler |
| crawlee | Python | Built-in | Apify's browser crawler |
| playwright | Python | Built-in | Microsoft's browser automation |
All tools output Markdown via the same html-to-markdown pipeline (except crawl4ai-raw). See METHODOLOGY.md for tool configurations and fairness decisions.
| Site | Pages | Type |
|---|---|---|
| quotes.toscrape.com | 15 | Simple paginated HTML |
| books.toscrape.com | 60 | E-commerce catalog |
| fastapi.tiangolo.com | 153 | API docs (code blocks, tutorials) |
| docs.python.org | 500 | Standard library reference |
| react.dev | 500 | SPA, JS-rendered |
| en.wikipedia.org | 50 | Tables, infoboxes, citations |
| docs.stripe.com | 500 | Tabbed content, code samples |
| github.blog | 200 | Blog articles, images |
| Report | Question it answers |
|---|---|
| Speed Comparison | Which crawler is fastest? |
| Quality Comparison | Which produces the cleanest Markdown? |
| Retrieval Comparison | Does cleaner Markdown improve retrieval? |
| Answer Quality | Does better retrieval improve LLM answers? |
| Cost at Scale | What does each crawler cost at 100K+ pages? |
| Pipeline Timing | How long does the full RAG pipeline take? |
| MarkCrawl Self-Benchmark | MarkCrawl standalone performance |
| Methodology | How were these benchmarks run? |
This benchmark is maintained by the creators of markcrawl, one of the tools tested. We designed the methodology to be fair (identical seed URLs, randomized execution order, published scripts), but readers should be aware of this relationship. All code and data are published so results can be independently verified. If you rerun and get different results, open an issue.
- Single machine, single location. All benchmarks run on one machine in one geographic location. Network latency to each site varies by location, so absolute pages/sec numbers will differ on your hardware.
- Live sites introduce variance. These are real public websites, not frozen snapshots. Server load, CDN caching, and content changes cause run-to-run variance. We report medians across 3 iterations to reduce noise.
- Markdown output only. We evaluate Markdown extraction quality. Tools that excel at structured data extraction (JSON, tables) may rank differently on those tasks.
- 7 tools, not all crawlers. We test the most common open-source crawlers used in RAG pipelines. Tools like Apify, ScrapingBee, and others are not included. See "Including a tool" below to add one.
- LLM-judged quality. Answer quality is scored by gpt-4o-mini, not human reviewers. LLM judges have known biases (verbosity preference, position effects). We mitigate with 4-dimension scoring but the scores are not ground truth.
- No anti-bot, authentication, or JS-heavy SPA testing. All test sites are publicly accessible and crawler-friendly. Results do not apply to sites with bot detection, rate limiting, or login walls.
We welcome contributions. To add a crawler to the benchmark:
- Open an issue describing the tool, its license, and what makes it relevant for LLM/RAG pipelines.
- Submit a PR with a runner script in
runners/that matches the interface of existing runners (accepts a URL list, outputs Markdown + JSONL index). Seerunners/README.mdfor the spec. - The tool must be open-source with a published package (pip, npm, go module, etc.).
- We run all benchmarks on the same hardware with the same sites and queries. You don't need to provide benchmark results -- just the runner.
# Install dependencies
pip install -e ".[dev]"
# Preflight check (verifies all tools are installed)
python preflight.py
# Run all benchmarks (~3-5 hours)
python benchmark_all_tools.py
# Run individual benchmarks
python benchmark_quality.py
python benchmark_retrieval.py
python benchmark_answer_quality.py
python benchmark_pipeline.py
python benchmark_markcrawl.py
# Regenerate this README from report data
python generate_readme.pydocker build -t llm-crawler-benchmarks .
docker run --rm \
-e OPENAI_API_KEY \
-v $(pwd)/reports:/app/reports \
-v $(pwd)/runs:/app/runs \
llm-crawler-benchmarksv2.0 -- 2026-05-12
When benchmark methodology changes (new sites, different scoring, updated tool versions), we increment the version. Results from different versions are not directly comparable. See METHODOLOGY.md for the full test setup.
Other projects benchmark parts of the web scraping pipeline:
- Firecrawl scrape-evals -- 1,000-URL extraction quality benchmark (precision/recall). Single-page quality only; no speed, retrieval, or LLM answer evaluation.
- WCXB -- 2,008-page content extraction leaderboard with word-level F1. Covers traditional tools (trafilatura, readability) but not LLM-era crawlers.
- Spider.cloud benchmark -- 3-tool comparison (Firecrawl, Crawl4AI, Spider) on throughput, cost, and RAG retrieval accuracy.
This project differs by evaluating the full RAG pipeline -- from crawl through chunk, embed, retrieve, and LLM answer -- across 7 tools, 8 sites, and 5 dimensions including downstream answer quality and cost at scale.
The self_improvement/ directory contains a 9-spec review framework for
auditing benchmark quality. See self_improvement/MASTER.md.
MIT