A fast and cross-platform CLI tool for scraping websites and downloading media files. Crawl intelligently handles both static HTML pages and JavaScript-rendered Single Page Applications (SPAs) with automatic detection and hybrid mode support.
- Hybrid Scraping Engine: Automatically detects and switches between static HTTP requests and headless Chrome for SPAs
- Smart Content Extraction: Extract specific content using CSS selectors
- Media Downloader: Download images, videos, and audio files with concurrent workers
- Multiple Output Formats: Save results as JSON, CSV, HTML, Markdown, or plain text
- Flexible Configuration: Configure via CLI flags, YAML config files, or environment variables
- Proxy Support: HTTP and SOCKS5 proxy support for anonymity
- Custom Headers: Add authentication tokens, user agents, and custom headers
- Rate Limiting: Built-in rate limiting to respect server resources
- Retry Logic: Automatic retry with exponential backoff for failed requests
- Progress Tracking: Real-time progress bars for long-running operations
- Caching: Intelligent caching to avoid redundant requests
Download the latest release for your platform from the releases page.
Requires Go 1.23 or later:
git clone https://github.com/zulfikawr/crawl.git
cd crawl
go build -o crawl ./cmd/crawlgo install github.com/zulfikawr/crawl/cmd/crawl@latest# Scrape a web page (auto-detects static vs SPA)
crawl get https://example.com
# Extract specific content with CSS selector
crawl get https://example.com --selector=".article-title"
# Save output to a file
crawl get https://example.com --output=data.json
# Force static mode for faster scraping
crawl get https://example.com --mode=static
# Force SPA mode for JavaScript-heavy sites
crawl get https://news.ycombinator.com --mode=spa# Download all images from a page
crawl media https://example.com --type=image
# Download videos with custom concurrency
crawl media https://example.com --type=video --concurrency=10
# Download all media types to a specific directory
crawl media https://example.com --type=all --output=./my-downloads
# Handle JavaScript-rendered media galleries
crawl media https://spa-gallery.com --mode=spa --type=image# Add custom headers (e.g., authentication)
crawl get https://api.example.com -H "Authorization: Bearer token123"
# Use a proxy
crawl get https://example.com --proxy=http://proxy.example.com:8080
# Set custom timeout
crawl get https://slow-site.com --timeout=60s
# Wait for dynamic content to load
crawl get https://dynamic-site.com --wait=5
# Extract CSV data with custom fields
crawl get https://example.com --output=data.csv --fields="title=.title,price=.price"
# Enable verbose logging for debugging
crawl get https://example.com --verboseRetrieve and extract content from a URL.
Usage:
crawl get <url> [flags]Flags:
-m, --mode string- Force engine mode: auto, static, or spa (default "auto")-s, --selector string- CSS selector to extract (default "body")-o, --output string- File path to save output (supports .json, .txt, .html, .csv, .md)-H, --header stringArray- Custom headers (can be specified multiple times)--wait int- Seconds to wait after page loads before scraping--fields string- Comma-separated fields for CSV export (e.g., "name=.name,price=.price")
Download media files from a URL.
Usage:
crawl media <url> [flags]Flags:
-t, --type string- Media type: image, video, audio, or all (default "all")-o, --output string- Directory to save files (default "./downloads")-c, --concurrency int- Number of concurrent workers (1-50) (default 5)-m, --mode string- Scraper mode: auto, static, or spa (default "auto")-H, --header stringArray- Custom headers--wait int- Seconds to wait after page loads
--config string- Path to YAML configuration file--proxy string- HTTP/SOCKS5 proxy URL--timeout string- Hard timeout for requests (default "30s")--user-agent string- Custom user agent string-v, --verbose- Enable debug logging-q, --quiet- Suppress all output except errors--json- Output in JSON format only-h, --help- Show help--version- Show version
Create a crawl.yaml file in your working directory or specify with --config:
# Logging
log_level: info
json_log: false
# HTTP Settings
http_timeout: 30s
user_agent: "Crawl/1.0 (https://github.com/zulfikawr/crawl)"
# Cache Settings
cache_ttl: 5m
cache_max_size_bytes: 104857600 # 100MB
# Browser Settings (for SPA mode)
browser_pool_size: 3
browser_headless: true
chrome_path: "" # Auto-detect if emptySee internal/config/examples/crawl.yaml for a complete example.
Configuration values can also be set via environment variables:
export CRAWL_LOG_LEVEL=debug
export CRAWL_HTTP_TIMEOUT=60s
export CRAWL_PROXY=http://proxy.example.com:8080Crawl uses a modular architecture with three scraping engines:
- Static Engine: Fast HTTP-based scraping for traditional HTML pages
- Dynamic Engine: Headless Chrome for JavaScript-rendered SPAs
- Hybrid Engine: Automatically detects and switches between modes
crawl/
├── cmd/crawl/ # CLI entry point
├── internal/
│ ├── app/ # Application initialization
│ ├── cache/ # HTTP caching layer
│ ├── cli/ # Command definitions
│ ├── config/ # Configuration management
│ ├── downloader/ # Media download logic
│ ├── engine/ # Scraping engines
│ │ ├── static/ # Static HTML scraper
│ │ ├── dynamic/ # Headless browser scraper
│ │ ├── hybrid/ # Auto-detection logic
│ │ ├── batch/ # Batch processing
│ │ └── metadata/ # Metadata extraction
│ ├── proxy/ # Proxy pool management
│ ├── ratelimit/ # Rate limiting
│ ├── retry/ # Retry logic
│ ├── ui/ # Terminal UI components
│ └── utils/ # Utility functions
└── pkg/models/ # Shared data models
- Go: 1.23 or later (for building from source)
- Chrome/Chromium: Required for SPA mode (automatically detected or specify with
chrome_path) - Operating Systems: Linux, macOS, Windows
# Run all tests
go test ./...
# Run tests with coverage
go test -cover ./...
# Run specific package tests
go test ./internal/engine/static/# Build for current platform
go build -o crawl ./cmd/crawl
# Build for all platforms
GOOS=linux GOARCH=amd64 go build -o crawl-linux-amd64 ./cmd/crawl
GOOS=darwin GOARCH=amd64 go build -o crawl-darwin-amd64 ./cmd/crawl
GOOS=windows GOARCH=amd64 go build -o crawl-windows-amd64.exe ./cmd/crawl- Price Monitoring: Extract product prices and availability
- Content Aggregation: Collect articles, blog posts, or news
- Data Mining: Extract structured data from websites
- Media Archival: Download images and videos from galleries
- Web Testing: Validate page content and structure
- Research: Collect data for analysis and research projects
- Use Static Mode: When possible, use
--mode=staticfor 10-100x faster scraping - Increase Concurrency: For media downloads, increase
--concurrencybased on your bandwidth - Enable Caching: Configure cache settings to avoid redundant requests
- Specific Selectors: Use precise CSS selectors instead of
bodyto reduce data processing - Adjust Timeouts: Set appropriate timeouts based on site speed
If SPA mode fails with a Chrome error:
# Specify Chrome path in config
chrome_path: "/usr/bin/chromium-browser"
# Or use environment variable
export CRAWL_CHROME_PATH=/usr/bin/google-chrome# Use proxy rotation
crawl get https://example.com --proxy=http://proxy1.example.com:8080
# Add delays between requests (via config)
rate_limit_requests: 10
rate_limit_duration: 1s# Increase wait time
crawl get https://example.com --wait=10
# Force SPA mode
crawl get https://example.com --mode=spaContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Cobra for CLI framework
- Uses chromedp for headless browsing
- Uses goquery for HTML parsing
- Powered by zerolog for structured logging
- Issues: GitHub Issues
- Discussions: GitHub Discussions