Skip to content

AIMLPM/llm-crawler-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

103 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-crawler-benchmarks

Which web crawler is best for LLM/RAG pipelines? We tested 7 tools across 8 sites to find out.

CI License

Head-to-head benchmark suite comparing web crawlers on speed, extraction quality, retrieval quality, LLM answer quality, and cost at scale. Every benchmark is reproducible from a single command.

Why this exists

Most crawler benchmarks test one dimension (speed or extraction accuracy) in isolation. But in an LLM/RAG pipeline, the crawler is stage 1 -- everything downstream (chunking, embedding, retrieval, LLM generation) depends on what the crawler produces. A tool that is fast but outputs noisy Markdown will inflate your embedding costs and degrade retrieval quality.

This project measures the full pipeline: crawl, chunk, embed, retrieve, and generate an LLM answer -- then scores each stage independently so you can see where the differences actually matter.

Key Findings

Dimension Winner Key metric Runner-up
Speed scrapy+md 5.0 pages/sec markcrawl (2.7 p/s)
Extraction quality markcrawl 99% content signal, 53 words preamble scrapy+md (92%, 500 words)
Retrieval quality crawlee 94% Hit@10, 0.765 MRR crawl4ai-raw (95%, 0.763)
LLM answer quality crawl4ai 4.72/5 overall score crawl4ai-raw (4.70/5)
Cost at scale markcrawl $4,505/yr (100K pages, 1K q/day) scrapy+md ($5,464/yr)
Pipeline timing markcrawl 440.7s end-to-end, $0.40 scrapy+md (451.3s, $1.05)

Leaderboard (Benchmark v2.0)

All 7 tools, sorted by speed. 8 sites, 109 retrieval queries, scored on 5 dimensions.

Tool Speed (p/s) Content Signal MRR Answer (/5) Cost (100K/yr)
scrapy+md 5.0 92% 0.176 3.68 $5,464
markcrawl 2.7 99% 0.341 3.77 $4,505
playwright 2.5 67% 0.758 4.48 $7,320
crawl4ai 1.4 82% 0.757 4.72 $6,960
crawl4ai-raw 1.4 83% 0.763 4.70 $6,961
crawlee 1.3 68% 0.765 4.68 $7,467
colly+md 1.0 70% 0.459 4.36 $7,213

Column definitions: Speed = pages/sec (median of 3 runs). Content Signal = (total words - preamble) / total words (higher = cleaner). MRR = Mean Reciprocal Rank, best retrieval mode per tool. Answer = LLM answer quality scored 1-5 by gpt-4o-mini. Cost = annual RAG pipeline cost at 100K pages, 1K queries/day.

Bottom line: scrapy+md is the fastest at 5.0 pages/sec; markcrawl is at 2.7. It also wins on pipeline timing ($0.40 end-to-end) and extraction quality (99% content signal). Answer quality is tight across all tools (3.68-4.72/5), with crawl4ai narrowly leading. Retrieval quality barely differs between tools -- switching retrieval mode (e.g., to reranked) gains more than switching crawlers.

Tools Compared

Tool Type JS rendering Notes
markcrawl Python Optional Markdown-first, lowest preamble
scrapy+md Python No Fastest raw HTTP crawler
crawl4ai Python Built-in AI-native, browser-based
crawl4ai-raw Python Built-in crawl4ai with raw HTML output
colly+md Go No Fast compiled crawler
crawlee Python Built-in Apify's browser crawler
playwright Python Built-in Microsoft's browser automation

All tools output Markdown via the same html-to-markdown pipeline (except crawl4ai-raw). See METHODOLOGY.md for tool configurations and fairness decisions.

Sites Tested

Site Pages Type
quotes.toscrape.com 15 Simple paginated HTML
books.toscrape.com 60 E-commerce catalog
fastapi.tiangolo.com 153 API docs (code blocks, tutorials)
docs.python.org 500 Standard library reference
react.dev 500 SPA, JS-rendered
en.wikipedia.org 50 Tables, infoboxes, citations
docs.stripe.com 500 Tabbed content, code samples
github.blog 200 Blog articles, images

Reports

Report Question it answers
Speed Comparison Which crawler is fastest?
Quality Comparison Which produces the cleanest Markdown?
Retrieval Comparison Does cleaner Markdown improve retrieval?
Answer Quality Does better retrieval improve LLM answers?
Cost at Scale What does each crawler cost at 100K+ pages?
Pipeline Timing How long does the full RAG pipeline take?
MarkCrawl Self-Benchmark MarkCrawl standalone performance
Methodology How were these benchmarks run?

Transparency

This benchmark is maintained by the creators of markcrawl, one of the tools tested. We designed the methodology to be fair (identical seed URLs, randomized execution order, published scripts), but readers should be aware of this relationship. All code and data are published so results can be independently verified. If you rerun and get different results, open an issue.

Limitations

  • Single machine, single location. All benchmarks run on one machine in one geographic location. Network latency to each site varies by location, so absolute pages/sec numbers will differ on your hardware.
  • Live sites introduce variance. These are real public websites, not frozen snapshots. Server load, CDN caching, and content changes cause run-to-run variance. We report medians across 3 iterations to reduce noise.
  • Markdown output only. We evaluate Markdown extraction quality. Tools that excel at structured data extraction (JSON, tables) may rank differently on those tasks.
  • 7 tools, not all crawlers. We test the most common open-source crawlers used in RAG pipelines. Tools like Apify, ScrapingBee, and others are not included. See "Including a tool" below to add one.
  • LLM-judged quality. Answer quality is scored by gpt-4o-mini, not human reviewers. LLM judges have known biases (verbosity preference, position effects). We mitigate with 4-dimension scoring but the scores are not ground truth.
  • No anti-bot, authentication, or JS-heavy SPA testing. All test sites are publicly accessible and crawler-friendly. Results do not apply to sites with bot detection, rate limiting, or login walls.

Including a tool

We welcome contributions. To add a crawler to the benchmark:

  1. Open an issue describing the tool, its license, and what makes it relevant for LLM/RAG pipelines.
  2. Submit a PR with a runner script in runners/ that matches the interface of existing runners (accepts a URL list, outputs Markdown + JSONL index). See runners/README.md for the spec.
  3. The tool must be open-source with a published package (pip, npm, go module, etc.).
  4. We run all benchmarks on the same hardware with the same sites and queries. You don't need to provide benchmark results -- just the runner.

Reproducing these results

# Install dependencies
pip install -e ".[dev]"

# Preflight check (verifies all tools are installed)
python preflight.py

# Run all benchmarks (~3-5 hours)
python benchmark_all_tools.py

# Run individual benchmarks
python benchmark_quality.py
python benchmark_retrieval.py
python benchmark_answer_quality.py
python benchmark_pipeline.py
python benchmark_markcrawl.py

# Regenerate this README from report data
python generate_readme.py

Docker

docker build -t llm-crawler-benchmarks .
docker run --rm \
  -e OPENAI_API_KEY \
  -v $(pwd)/reports:/app/reports \
  -v $(pwd)/runs:/app/runs \
  llm-crawler-benchmarks

Benchmark version

v2.0 -- 2026-05-12

When benchmark methodology changes (new sites, different scoring, updated tool versions), we increment the version. Results from different versions are not directly comparable. See METHODOLOGY.md for the full test setup.

Related Work

Other projects benchmark parts of the web scraping pipeline:

  • Firecrawl scrape-evals -- 1,000-URL extraction quality benchmark (precision/recall). Single-page quality only; no speed, retrieval, or LLM answer evaluation.
  • WCXB -- 2,008-page content extraction leaderboard with word-level F1. Covers traditional tools (trafilatura, readability) but not LLM-era crawlers.
  • Spider.cloud benchmark -- 3-tool comparison (Firecrawl, Crawl4AI, Spider) on throughput, cost, and RAG retrieval accuracy.

This project differs by evaluating the full RAG pipeline -- from crawl through chunk, embed, retrieve, and LLM answer -- across 7 tools, 8 sites, and 5 dimensions including downstream answer quality and cost at scale.

Self-Improvement Framework

The self_improvement/ directory contains a 9-spec review framework for auditing benchmark quality. See self_improvement/MASTER.md.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages