Skip to content

Recursively crawl any documentation site and save pages as Markdown files. Preserves directory structure. Built with Crawl4AI + Playwright

Notifications You must be signed in to change notification settings

ehfazrezwan/docs-crawler

Repository files navigation

Docs Crawler

A recursive documentation site crawler that saves pages as Markdown files, preserving the original directory structure. Built with Crawl4AI.

Features

  • Recursively crawls documentation sites
  • Preserves original directory structure
  • Saves pages as clean Markdown files
  • Configurable concurrency and rate limiting
  • Language filtering (excludes non-English docs by default)
  • Skips assets, images, and non-documentation resources

Installation

1. Create virtual environment

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

3. Install browser

playwright install chromium

Usage

Basic usage

python crawl.py <url>

Examples

# Crawl Ultralytics docs
python crawl.py https://docs.ultralytics.com

# Crawl with custom output directory
python crawl.py https://docs.python.org -o python-docs

# Limit to 50 pages (useful for testing)
python crawl.py https://docs.example.com --max-pages 50

# Include all language versions
python crawl.py https://docs.example.com --include-all-langs

# Custom language exclusions
python crawl.py https://docs.example.com --exclude-langs zh,ko,ja

# Faster crawling (more concurrent requests, less delay)
python crawl.py https://docs.example.com --max-concurrent 10 --delay 0.5

# Quiet mode (suppress progress output)
python crawl.py https://docs.example.com -q

Command-line options

Option Default Description
url (required) Base URL of the documentation site
-o, --output auto Output directory (derived from domain if not specified)
--max-pages unlimited Maximum number of pages to crawl
--max-concurrent 5 Maximum concurrent requests
--delay 1.0 Delay between batches in seconds
--exclude-langs zh,ko,ja,... Comma-separated language codes to exclude
--include-all-langs false Include all language versions
-q, --quiet false Suppress progress output

Output

The crawler creates a directory structure mirroring the original site:

python-docs/
├── index.md
├── quickstart.md
├── tutorial/
│   ├── index.md
│   └── basics.md
├── reference/
│   ├── api.md
│   └── functions.md
└── _crawl_summary.txt

Each Markdown file includes a source URL comment:

<!-- Source: https://docs.python.org/tutorial/basics -->

# Tutorial Basics
...

LLM Index Generator

After crawling, you can generate an intelligent index with AI-powered summaries and keywords using generate_index_llm.py.

Setup

pip install langchain-openai

Environment Variables

Variable Required Description
LLM_GATEWAY_API_KEY Yes API key for the LLM provider
LLM_GATEWAY_BASE_URL No Base URL for OpenAI-compatible API (default: https://api.openai.com/v1)
LLM_MODEL No Model to use (default: gpt-4o-mini)

Usage

# Set your API key
export LLM_GATEWAY_API_KEY="your-api-key"

# Generate index for crawled docs
python generate_index_llm.py --docs-dir python-docs

# Custom output file
python generate_index_llm.py --docs-dir python-docs --output python-docs/INDEX.md

# Use a different LLM provider/model (e.g., Anthropic via OpenRouter)
export LLM_GATEWAY_BASE_URL="https://openrouter.ai/api/v1"
export LLM_MODEL="anthropic/claude-3-haiku"
python generate_index_llm.py --docs-dir python-docs

# Start fresh (ignore cache)
python generate_index_llm.py --docs-dir python-docs --no-cache

Command-line Options

Option Default Description
--docs-dir docs Directory containing crawled Markdown files
--output docs/INDEX_LLM.md Output index file path
--cache docs/.index_cache.json Cache file for resuming interrupted runs
--no-cache false Ignore existing cache and start fresh
--api-key env var LLM Gateway API key
--base-url env var LLM API base URL
--model env var LLM model to use

Features

  • Smart truncation: Only sends first ~1000 tokens of each doc to the LLM
  • JSON caching: Saves results after each file, so you can resume if interrupted
  • Batch writing: Writes index every 10 files for progress visibility
  • Fallback handling: Gracefully handles LLM failures

Testing

Run the test suite:

source env/bin/activate

# Run all unit tests (fast)
python tests/run_tests.py

# Include integration tests (slower, requires network)
python tests/run_tests.py --integration

# Run individual test files
python tests/test_crawl.py          # Crawler tests only
python tests/test_generate_index.py  # Index generator tests only

How It Works

  1. Starts at the provided base URL
  2. Extracts all internal links from each page
  3. Filters links based on domain, language, and resource type
  4. Crawls pages concurrently in batches
  5. Saves each page as Markdown, preserving directory structure
  6. Repeats until all pages are crawled or limit is reached

Configuration

Default excluded languages

zh, ko, ja, de, es, fr, pt, ru, hi, ar, it, nl, tr, vi

Default skipped paths

search, assets, javascripts, stylesheets, _static, _images, static, js, css

Default skipped extensions

.png, .jpg, .jpeg, .gif, .svg, .pdf, .zip, .whl, .tar.gz, .ico, .webp

License

MIT

About

Recursively crawl any documentation site and save pages as Markdown files. Preserves directory structure. Built with Crawl4AI + Playwright

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages