Advanced Web Scraper (Windows-friendly)

A robust, ethically-minded web scraper for collecting educational/religious content (e.g., Bible sites). It includes:

Respectful crawling with delays and robots.txt checks
403 bypass attempts (archive cache, session simulation, UA rotation, bot UA, path variations)
Content extraction with heuristics and optional site-specific selectors
Safe file downloading with size/type checks and resume support
Optional conversions: HTML → PDF (WeasyPrint) and HTML → DOCX (python-docx)
Detailed logging and a comprehensive JSON report

Built for Windows PowerShell users and tested with Windows-specific guidance (GTK3 for WeasyPrint).

Folder Structure

scraper.py               # Main scraper CLI and library
test_403_bypass.py       # Standalone 403 bypass technique tester
weasyprint_installer.py  # Windows helper for installing WeasyPrint dependencies

Features

Stealth session: realistic headers, cookies, UA rotation
Proxy rotation (optional)
Security scanning for HTML and downloaded files
Content de-duplication via hashing
Threaded scraping option
Automatic reporting: scraping_report.json

Requirements

Python 3.10+ (Windows)
Recommended packages:
- requests
- beautifulsoup4
- urllib3
- python-docx (optional; for DOCX conversion)
- weasyprint (optional; for PDF conversion)

If WeasyPrint is used on Windows, GTK3 is required.

Quick Start (PowerShell)

Create and activate a virtual environment (optional but recommended):

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install dependencies:

# Option A: install from requirements file
pip install -r .\requirements.txt

# Option B: install packages individually
pip install requests beautifulsoup4 urllib3 python-docx weasyprint

If you don’t need conversions, you can skip python-docx and weasyprint.
For WeasyPrint on Windows, run the helper script below.

Run the scraper with defaults (ethical preset included in fallback):

python .\scraper.py "https://www.ethiopianorthodox.org/" --dest-folder "amharic_bible_files"

Threaded scraping example:

python .\scraper.py "https://www.ethiopianorthodox.org/" --workers 2 --max-depth 2 --delay 2

Site-specific content selectors:

python .\scraper.py "https://example.com/bible" --custom-selectors ".bible-text" ".content-body" ".main-content"

Common flags:

--dest-folder Output directory (default: scraped_content)
--max-depth Crawl depth (default: 2)
--delay Seconds between requests (default: 1.0)
--workers Thread pool size (default: 1)
--custom-selectors One or more CSS selectors to target main content
--no-robots Ignore robots.txt
--force-scrape Enable ethical force scrape (longer delays, limited workers, educational UA)
--no-dedup Disable content de-duplication
Security/stealth:
- --use-proxies Enable proxies
- --proxy-list List of proxy URLs (e.g. http://host:port)
- --no-ua-rotation Disable UA rotation
- --no-security-scan Disable security scan
- --max-file-size Max bytes per file (default: 52428800)
Conversion:
- --no-pdf Disable HTML→PDF
- --no-docx Disable HTML→DOCX

Output

Saved HTML files: name_depth.html
Optional converted files: name_depth.pdf, name_depth.docx
Extracted text: name_depth.txt including URL, timestamp, depth, content hash, and security status
Logs: scraper.log under the destination folder
Report: scraping_report.json with stats and file counts

WeasyPrint on Windows (PDF Conversion)

WeasyPrint requires GTK3. Use the helper to check and install prerequisites:

python .\weasyprint_installer.py

It will:

Check WeasyPrint import
Check for GTK3 DLL path
Check VC++ Redistributable
Offer to open installers in your browser

If WeasyPrint isn’t available, the scraper will skip PDF conversion and log guidance.

403 Bypass Techniques (Standalone Tester)

To experiment with bypass strategies independently:

python .\test_403_bypass.py

It tries multiple techniques and saves the first successful HTML response to bypass_success_<timestamp>.html.

Ethical Use

Respect robots.txt unless explicitly using --no-robots or --force-scrape for bona fide educational use.
Increase delays and limit threads when accessing sensitive or rate-limited sites.
Do not scrape personal data or violate terms of service. Use responsibly.

Troubleshooting

"WeasyPrint not available" → Run weasyprint_installer.py and ensure GTK3 is installed and in PATH, then restart PowerShell/VS Code.
"Expected file but got HTML" when downloading → The link is likely a page, not a direct file URL.
Many 403s → Try --force-scrape, add custom selectors, experiment with test_403_bypass.py, or use proxies.
SSL warnings → These are suppressed in bypass tests; for production scraping you can enable verification.

Development Notes

The scraper uses requests and BeautifulSoup. Threading is via ThreadPoolExecutor.
Content de-duplication uses normalized text hashing.
Security validation checks headers and file signatures for PDFs and ZIPs.

License

MIT License. See LICENSE for full terms.

This project is intended for educational purposes. Please also respect third-party site terms of use and robots.txt when scraping.

Acknowledgements

WeasyPrint: https://weasyprint.org/
GTK for Windows Runtime: https://github.com/tschoonj/GTK-for-Windows-Runtime-Environment-Installer/releases
BeautifulSoup (bs4): https://www.crummy.com/software/BeautifulSoup/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Web Scraper (Windows-friendly)

Folder Structure

Features

Requirements

Quick Start (PowerShell)

Output

WeasyPrint on Windows (PDF Conversion)

403 Bypass Techniques (Standalone Tester)

Ethical Use

Troubleshooting

Development Notes

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py
test_403_bypass.py		test_403_bypass.py
weasyprint_installer.py		weasyprint_installer.py

Folders and files

Latest commit

History

Repository files navigation

Advanced Web Scraper (Windows-friendly)

Folder Structure

Features

Requirements

Quick Start (PowerShell)

Output

WeasyPrint on Windows (PDF Conversion)

403 Bypass Techniques (Standalone Tester)

Ethical Use

Troubleshooting

Development Notes

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages