A robust, ethically-minded web scraper for collecting educational/religious content (e.g., Bible sites). It includes:
- Respectful crawling with delays and robots.txt checks
- 403 bypass attempts (archive cache, session simulation, UA rotation, bot UA, path variations)
- Content extraction with heuristics and optional site-specific selectors
- Safe file downloading with size/type checks and resume support
- Optional conversions: HTML → PDF (WeasyPrint) and HTML → DOCX (python-docx)
- Detailed logging and a comprehensive JSON report
Built for Windows PowerShell users and tested with Windows-specific guidance (GTK3 for WeasyPrint).
scraper.py # Main scraper CLI and library
test_403_bypass.py # Standalone 403 bypass technique tester
weasyprint_installer.py # Windows helper for installing WeasyPrint dependencies
- Stealth session: realistic headers, cookies, UA rotation
- Proxy rotation (optional)
- Security scanning for HTML and downloaded files
- Content de-duplication via hashing
- Threaded scraping option
- Automatic reporting:
scraping_report.json
- Python 3.10+ (Windows)
- Recommended packages:
- requests
- beautifulsoup4
- urllib3
- python-docx (optional; for DOCX conversion)
- weasyprint (optional; for PDF conversion)
If WeasyPrint is used on Windows, GTK3 is required.
- Create and activate a virtual environment (optional but recommended):
python -m venv .venv
.\.venv\Scripts\Activate.ps1- Install dependencies:
# Option A: install from requirements file
pip install -r .\requirements.txt
# Option B: install packages individually
pip install requests beautifulsoup4 urllib3 python-docx weasyprint- If you don’t need conversions, you can skip
python-docxandweasyprint. - For WeasyPrint on Windows, run the helper script below.
- Run the scraper with defaults (ethical preset included in fallback):
python .\scraper.py "https://www.ethiopianorthodox.org/" --dest-folder "amharic_bible_files"- Threaded scraping example:
python .\scraper.py "https://www.ethiopianorthodox.org/" --workers 2 --max-depth 2 --delay 2- Site-specific content selectors:
python .\scraper.py "https://example.com/bible" --custom-selectors ".bible-text" ".content-body" ".main-content"Common flags:
--dest-folderOutput directory (default:scraped_content)--max-depthCrawl depth (default: 2)--delaySeconds between requests (default: 1.0)--workersThread pool size (default: 1)--custom-selectorsOne or more CSS selectors to target main content--no-robotsIgnore robots.txt--force-scrapeEnable ethical force scrape (longer delays, limited workers, educational UA)--no-dedupDisable content de-duplication- Security/stealth:
--use-proxiesEnable proxies--proxy-listList of proxy URLs (e.g.http://host:port)--no-ua-rotationDisable UA rotation--no-security-scanDisable security scan--max-file-sizeMax bytes per file (default: 52428800)
- Conversion:
--no-pdfDisable HTML→PDF--no-docxDisable HTML→DOCX
- Saved HTML files:
name_depth.html - Optional converted files:
name_depth.pdf,name_depth.docx - Extracted text:
name_depth.txtincluding URL, timestamp, depth, content hash, and security status - Logs:
scraper.logunder the destination folder - Report:
scraping_report.jsonwith stats and file counts
WeasyPrint requires GTK3. Use the helper to check and install prerequisites:
python .\weasyprint_installer.pyIt will:
- Check WeasyPrint import
- Check for GTK3 DLL path
- Check VC++ Redistributable
- Offer to open installers in your browser
If WeasyPrint isn’t available, the scraper will skip PDF conversion and log guidance.
To experiment with bypass strategies independently:
python .\test_403_bypass.pyIt tries multiple techniques and saves the first successful HTML response to bypass_success_<timestamp>.html.
- Respect robots.txt unless explicitly using
--no-robotsor--force-scrapefor bona fide educational use. - Increase delays and limit threads when accessing sensitive or rate-limited sites.
- Do not scrape personal data or violate terms of service. Use responsibly.
- "WeasyPrint not available" → Run
weasyprint_installer.pyand ensure GTK3 is installed and in PATH, then restart PowerShell/VS Code. - "Expected file but got HTML" when downloading → The link is likely a page, not a direct file URL.
- Many 403s → Try
--force-scrape, add custom selectors, experiment withtest_403_bypass.py, or use proxies. - SSL warnings → These are suppressed in bypass tests; for production scraping you can enable verification.
- The scraper uses
requestsandBeautifulSoup. Threading is viaThreadPoolExecutor. - Content de-duplication uses normalized text hashing.
- Security validation checks headers and file signatures for PDFs and ZIPs.
MIT License. See LICENSE for full terms.
This project is intended for educational purposes. Please also respect third-party site terms of use and robots.txt when scraping.
- WeasyPrint: https://weasyprint.org/
- GTK for Windows Runtime: https://github.com/tschoonj/GTK-for-Windows-Runtime-Environment-Installer/releases
- BeautifulSoup (bs4): https://www.crummy.com/software/BeautifulSoup/