A recursive documentation site crawler that saves pages as Markdown files, preserving the original directory structure. Built with Crawl4AI.
- Recursively crawls documentation sites
- Preserves original directory structure
- Saves pages as clean Markdown files
- Configurable concurrency and rate limiting
- Language filtering (excludes non-English docs by default)
- Skips assets, images, and non-documentation resources
python3 -m venv env
source env/bin/activate # On Windows: env\Scripts\activatepip install -r requirements.txtplaywright install chromiumpython crawl.py <url># Crawl Ultralytics docs
python crawl.py https://docs.ultralytics.com
# Crawl with custom output directory
python crawl.py https://docs.python.org -o python-docs
# Limit to 50 pages (useful for testing)
python crawl.py https://docs.example.com --max-pages 50
# Include all language versions
python crawl.py https://docs.example.com --include-all-langs
# Custom language exclusions
python crawl.py https://docs.example.com --exclude-langs zh,ko,ja
# Faster crawling (more concurrent requests, less delay)
python crawl.py https://docs.example.com --max-concurrent 10 --delay 0.5
# Quiet mode (suppress progress output)
python crawl.py https://docs.example.com -q| Option | Default | Description |
|---|---|---|
url |
(required) | Base URL of the documentation site |
-o, --output |
auto | Output directory (derived from domain if not specified) |
--max-pages |
unlimited | Maximum number of pages to crawl |
--max-concurrent |
5 | Maximum concurrent requests |
--delay |
1.0 | Delay between batches in seconds |
--exclude-langs |
zh,ko,ja,... | Comma-separated language codes to exclude |
--include-all-langs |
false | Include all language versions |
-q, --quiet |
false | Suppress progress output |
The crawler creates a directory structure mirroring the original site:
python-docs/
├── index.md
├── quickstart.md
├── tutorial/
│ ├── index.md
│ └── basics.md
├── reference/
│ ├── api.md
│ └── functions.md
└── _crawl_summary.txt
Each Markdown file includes a source URL comment:
<!-- Source: https://docs.python.org/tutorial/basics -->
# Tutorial Basics
...After crawling, you can generate an intelligent index with AI-powered summaries and keywords using generate_index_llm.py.
pip install langchain-openai| Variable | Required | Description |
|---|---|---|
LLM_GATEWAY_API_KEY |
Yes | API key for the LLM provider |
LLM_GATEWAY_BASE_URL |
No | Base URL for OpenAI-compatible API (default: https://api.openai.com/v1) |
LLM_MODEL |
No | Model to use (default: gpt-4o-mini) |
# Set your API key
export LLM_GATEWAY_API_KEY="your-api-key"
# Generate index for crawled docs
python generate_index_llm.py --docs-dir python-docs
# Custom output file
python generate_index_llm.py --docs-dir python-docs --output python-docs/INDEX.md
# Use a different LLM provider/model (e.g., Anthropic via OpenRouter)
export LLM_GATEWAY_BASE_URL="https://openrouter.ai/api/v1"
export LLM_MODEL="anthropic/claude-3-haiku"
python generate_index_llm.py --docs-dir python-docs
# Start fresh (ignore cache)
python generate_index_llm.py --docs-dir python-docs --no-cache| Option | Default | Description |
|---|---|---|
--docs-dir |
docs |
Directory containing crawled Markdown files |
--output |
docs/INDEX_LLM.md |
Output index file path |
--cache |
docs/.index_cache.json |
Cache file for resuming interrupted runs |
--no-cache |
false | Ignore existing cache and start fresh |
--api-key |
env var | LLM Gateway API key |
--base-url |
env var | LLM API base URL |
--model |
env var | LLM model to use |
- Smart truncation: Only sends first ~1000 tokens of each doc to the LLM
- JSON caching: Saves results after each file, so you can resume if interrupted
- Batch writing: Writes index every 10 files for progress visibility
- Fallback handling: Gracefully handles LLM failures
Run the test suite:
source env/bin/activate
# Run all unit tests (fast)
python tests/run_tests.py
# Include integration tests (slower, requires network)
python tests/run_tests.py --integration
# Run individual test files
python tests/test_crawl.py # Crawler tests only
python tests/test_generate_index.py # Index generator tests only- Starts at the provided base URL
- Extracts all internal links from each page
- Filters links based on domain, language, and resource type
- Crawls pages concurrently in batches
- Saves each page as Markdown, preserving directory structure
- Repeats until all pages are crawled or limit is reached
zh, ko, ja, de, es, fr, pt, ru, hi, ar, it, nl, tr, vi
search, assets, javascripts, stylesheets, _static, _images, static, js, css
.png, .jpg, .jpeg, .gif, .svg, .pdf, .zip, .whl, .tar.gz, .ico, .webp
MIT