Skip to content

bezhaile/ethical-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advanced Web Scraper (Windows-friendly)

A robust, ethically-minded web scraper for collecting educational/religious content (e.g., Bible sites). It includes:

  • Respectful crawling with delays and robots.txt checks
  • 403 bypass attempts (archive cache, session simulation, UA rotation, bot UA, path variations)
  • Content extraction with heuristics and optional site-specific selectors
  • Safe file downloading with size/type checks and resume support
  • Optional conversions: HTML → PDF (WeasyPrint) and HTML → DOCX (python-docx)
  • Detailed logging and a comprehensive JSON report

Built for Windows PowerShell users and tested with Windows-specific guidance (GTK3 for WeasyPrint).


Folder Structure

scraper.py               # Main scraper CLI and library
test_403_bypass.py       # Standalone 403 bypass technique tester
weasyprint_installer.py  # Windows helper for installing WeasyPrint dependencies

Features

  • Stealth session: realistic headers, cookies, UA rotation
  • Proxy rotation (optional)
  • Security scanning for HTML and downloaded files
  • Content de-duplication via hashing
  • Threaded scraping option
  • Automatic reporting: scraping_report.json

Requirements

  • Python 3.10+ (Windows)
  • Recommended packages:
    • requests
    • beautifulsoup4
    • urllib3
    • python-docx (optional; for DOCX conversion)
    • weasyprint (optional; for PDF conversion)

If WeasyPrint is used on Windows, GTK3 is required.


Quick Start (PowerShell)

  1. Create and activate a virtual environment (optional but recommended):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
  1. Install dependencies:
# Option A: install from requirements file
pip install -r .\requirements.txt

# Option B: install packages individually
pip install requests beautifulsoup4 urllib3 python-docx weasyprint
  • If you don’t need conversions, you can skip python-docx and weasyprint.
  • For WeasyPrint on Windows, run the helper script below.
  1. Run the scraper with defaults (ethical preset included in fallback):
python .\scraper.py "https://www.ethiopianorthodox.org/" --dest-folder "amharic_bible_files"
  1. Threaded scraping example:
python .\scraper.py "https://www.ethiopianorthodox.org/" --workers 2 --max-depth 2 --delay 2
  1. Site-specific content selectors:
python .\scraper.py "https://example.com/bible" --custom-selectors ".bible-text" ".content-body" ".main-content"

Common flags:

  • --dest-folder Output directory (default: scraped_content)
  • --max-depth Crawl depth (default: 2)
  • --delay Seconds between requests (default: 1.0)
  • --workers Thread pool size (default: 1)
  • --custom-selectors One or more CSS selectors to target main content
  • --no-robots Ignore robots.txt
  • --force-scrape Enable ethical force scrape (longer delays, limited workers, educational UA)
  • --no-dedup Disable content de-duplication
  • Security/stealth:
    • --use-proxies Enable proxies
    • --proxy-list List of proxy URLs (e.g. http://host:port)
    • --no-ua-rotation Disable UA rotation
    • --no-security-scan Disable security scan
    • --max-file-size Max bytes per file (default: 52428800)
  • Conversion:
    • --no-pdf Disable HTML→PDF
    • --no-docx Disable HTML→DOCX

Output

  • Saved HTML files: name_depth.html
  • Optional converted files: name_depth.pdf, name_depth.docx
  • Extracted text: name_depth.txt including URL, timestamp, depth, content hash, and security status
  • Logs: scraper.log under the destination folder
  • Report: scraping_report.json with stats and file counts

WeasyPrint on Windows (PDF Conversion)

WeasyPrint requires GTK3. Use the helper to check and install prerequisites:

python .\weasyprint_installer.py

It will:

  • Check WeasyPrint import
  • Check for GTK3 DLL path
  • Check VC++ Redistributable
  • Offer to open installers in your browser

If WeasyPrint isn’t available, the scraper will skip PDF conversion and log guidance.


403 Bypass Techniques (Standalone Tester)

To experiment with bypass strategies independently:

python .\test_403_bypass.py

It tries multiple techniques and saves the first successful HTML response to bypass_success_<timestamp>.html.


Ethical Use

  • Respect robots.txt unless explicitly using --no-robots or --force-scrape for bona fide educational use.
  • Increase delays and limit threads when accessing sensitive or rate-limited sites.
  • Do not scrape personal data or violate terms of service. Use responsibly.

Troubleshooting

  • "WeasyPrint not available" → Run weasyprint_installer.py and ensure GTK3 is installed and in PATH, then restart PowerShell/VS Code.
  • "Expected file but got HTML" when downloading → The link is likely a page, not a direct file URL.
  • Many 403s → Try --force-scrape, add custom selectors, experiment with test_403_bypass.py, or use proxies.
  • SSL warnings → These are suppressed in bypass tests; for production scraping you can enable verification.

Development Notes

  • The scraper uses requests and BeautifulSoup. Threading is via ThreadPoolExecutor.
  • Content de-duplication uses normalized text hashing.
  • Security validation checks headers and file signatures for PDFs and ZIPs.

License

MIT License. See LICENSE for full terms.

This project is intended for educational purposes. Please also respect third-party site terms of use and robots.txt when scraping.


Acknowledgements

About

A Windows-friendly, ethically-minded web scraper for educational/religious content. Includes safe downloads, 403 bypass testing, optional HTML→PDF/DOCX conversion, and detailed reporting.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages