Docs Crawler

A recursive documentation site crawler that saves pages as Markdown files, preserving the original directory structure. Built with Crawl4AI.

Features

Recursively crawls documentation sites
Preserves original directory structure
Saves pages as clean Markdown files
Configurable concurrency and rate limiting
Language filtering (excludes non-English docs by default)
Skips assets, images, and non-documentation resources

Installation

1. Create virtual environment

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

3. Install browser

playwright install chromium

Usage

Basic usage

python crawl.py <url>

Examples

# Crawl Ultralytics docs
python crawl.py https://docs.ultralytics.com

# Crawl with custom output directory
python crawl.py https://docs.python.org -o python-docs

# Limit to 50 pages (useful for testing)
python crawl.py https://docs.example.com --max-pages 50

# Include all language versions
python crawl.py https://docs.example.com --include-all-langs

# Custom language exclusions
python crawl.py https://docs.example.com --exclude-langs zh,ko,ja

# Faster crawling (more concurrent requests, less delay)
python crawl.py https://docs.example.com --max-concurrent 10 --delay 0.5

# Quiet mode (suppress progress output)
python crawl.py https://docs.example.com -q

Command-line options

Option	Default	Description
`url`	(required)	Base URL of the documentation site
`-o, --output`	auto	Output directory (derived from domain if not specified)
`--max-pages`	unlimited	Maximum number of pages to crawl
`--max-concurrent`	5	Maximum concurrent requests
`--delay`	1.0	Delay between batches in seconds
`--exclude-langs`	zh,ko,ja,...	Comma-separated language codes to exclude
`--include-all-langs`	false	Include all language versions
`-q, --quiet`	false	Suppress progress output

Output

The crawler creates a directory structure mirroring the original site:

python-docs/
├── index.md
├── quickstart.md
├── tutorial/
│   ├── index.md
│   └── basics.md
├── reference/
│   ├── api.md
│   └── functions.md
└── _crawl_summary.txt

Each Markdown file includes a source URL comment:

<!-- Source: https://docs.python.org/tutorial/basics -->

# Tutorial Basics
...

LLM Index Generator

After crawling, you can generate an intelligent index with AI-powered summaries and keywords using generate_index_llm.py.

Setup

pip install langchain-openai

Environment Variables

Variable	Required	Description
`LLM_GATEWAY_API_KEY`	Yes	API key for the LLM provider
`LLM_GATEWAY_BASE_URL`	No	Base URL for OpenAI-compatible API (default: `https://api.openai.com/v1`)
`LLM_MODEL`	No	Model to use (default: `gpt-4o-mini`)

Usage

# Set your API key
export LLM_GATEWAY_API_KEY="your-api-key"

# Generate index for crawled docs
python generate_index_llm.py --docs-dir python-docs

# Custom output file
python generate_index_llm.py --docs-dir python-docs --output python-docs/INDEX.md

# Use a different LLM provider/model (e.g., Anthropic via OpenRouter)
export LLM_GATEWAY_BASE_URL="https://openrouter.ai/api/v1"
export LLM_MODEL="anthropic/claude-3-haiku"
python generate_index_llm.py --docs-dir python-docs

# Start fresh (ignore cache)
python generate_index_llm.py --docs-dir python-docs --no-cache

Command-line Options

Option	Default	Description
`--docs-dir`	`docs`	Directory containing crawled Markdown files
`--output`	`docs/INDEX_LLM.md`	Output index file path
`--cache`	`docs/.index_cache.json`	Cache file for resuming interrupted runs
`--no-cache`	false	Ignore existing cache and start fresh
`--api-key`	env var	LLM Gateway API key
`--base-url`	env var	LLM API base URL
`--model`	env var	LLM model to use

Features

Smart truncation: Only sends first ~1000 tokens of each doc to the LLM
JSON caching: Saves results after each file, so you can resume if interrupted
Batch writing: Writes index every 10 files for progress visibility
Fallback handling: Gracefully handles LLM failures

Testing

Run the test suite:

source env/bin/activate

# Run all unit tests (fast)
python tests/run_tests.py

# Include integration tests (slower, requires network)
python tests/run_tests.py --integration

# Run individual test files
python tests/test_crawl.py          # Crawler tests only
python tests/test_generate_index.py  # Index generator tests only

How It Works

Starts at the provided base URL
Extracts all internal links from each page
Filters links based on domain, language, and resource type
Crawls pages concurrently in batches
Saves each page as Markdown, preserving directory structure
Repeats until all pages are crawled or limit is reached

Configuration

Default excluded languages

zh, ko, ja, de, es, fr, pt, ru, hi, ar, it, nl, tr, vi

Default skipped paths

search, assets, javascripts, stylesheets, _static, _images, static, js, css

Default skipped extensions

.png, .jpg, .jpeg, .gif, .svg, .pdf, .zip, .whl, .tar.gz, .ico, .webp

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.claude/skills/docs-crawler		.claude/skills/docs-crawler
tests		tests
.gitignore		.gitignore
README.md		README.md
crawl.py		crawl.py
generate_index_llm.py		generate_index_llm.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docs Crawler

Features

Installation

1. Create virtual environment

2. Install dependencies

3. Install browser

Usage

Basic usage

Examples

Command-line options

Output

LLM Index Generator

Setup

Environment Variables

Usage

Command-line Options

Features

Testing

How It Works

Configuration

Default excluded languages

Default skipped paths

Default skipped extensions

License

About

Uh oh!

Releases

Packages

Languages

ehfazrezwan/docs-crawler

Folders and files

Latest commit

History

Repository files navigation

Docs Crawler

Features

Installation

1. Create virtual environment

2. Install dependencies

3. Install browser

Usage

Basic usage

Examples

Command-line options

Output

LLM Index Generator

Setup

Environment Variables

Usage

Command-line Options

Features

Testing

How It Works

Configuration

Default excluded languages

Default skipped paths

Default skipped extensions

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages