Skip to content

Ar1sto/RavenHunter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RavenHunter: CIA Document Scraping Tool

RavenHunter Logo

...is a tool designed to scrape all available documents ever published by the CIA under the Freedom of Information Act (FOIA). This guide covers installation, usage, and additional notes for advanced users, including special considerations when scraping `.gov` websites.

=== ATTENTION ===

The CIA actively blocks traffic from the TOR network when scraping and VPN providers give timeout error messages. I haven't had the time to look into other options like implementing proxy rotation yet. If you want to support me, please send me your criticism and suggestions.

Table of Contents

  1. Installation
  2. Usage
  3. Advanced Usage
  4. Notes on Scraping .gov Websites
  5. License

Installation

Requirements

Before you begin, ensure you have Python 3.6+ installed. Then, install the required dependencies by running:

pip install -r requirements.txt

Dependencies

  • requests: HTTP requests library.
  • tqdm: Progress bar library for Python.
  • colorama: Cross-platform support for colored terminal text.
  • playwright: For browser automation and scraping.
  • beautifulsoup4: HTML parsing library for scraping and navigating web pages.

If you haven't already installed the necessary browsers for Playwright, run:

playwright install

This will install the required Chromium browser, as Playwright uses a headless browser to render pages for scraping.


Usage

Running the Script

You can start the script using the command line. Here are the available options:

python RavenHunter.py --help

Example Usage

To begin scraping and storing found links, run:

python RavenHunter.py

For specific tasks, use the following flags:

  • Export found links:

    • Export links to a CSV or JSON file.
    • Example:
    python RavenHunter.py --export json
  • Download found documents:

    • Download all documents found in the scraping process.
    • Example:
    python RavenHunter.py -dl
  • Resume scraping:

    • Resume the scraping from the last page saved.
    • Example:
    python RavenHunter.py --resume

Command Line Options

Option Description
-dl Downloads found documents.
--timeout Set timeout in seconds between page scrapes. (Default: 10)
--resume Resumes scraping from the last scraped page.
--export Export found links to either CSV or JSON.
--verbose Display links while scraping.
--tor Use Tor proxy (127.0.0.1:9050).

Advanced Usage

Scraping Multiple Pages

The script is designed to scrape multiple pages of documents. It uses Playwright to automate browser interaction and BeautifulSoup for parsing the HTML content. If you wish to scrape from a specific page or control the process, you can modify the resume_page variable in the script.

🧅 Using RavenHunter with Tor (--tor)

If you want to scrape CIA documents anonymously through the Tor network, you can enable the --tor option:

python RavenHunter.py --tor

This will use a SOCKS5 proxy at 127.0.0.1:9050 (default Tor config).

Important:

  • Make sure the Tor service is running locally before using this option.

  • Linux: Install and start Tor via:

sudo apt install tor
sudo systemctl start tor
  • Windows: Download and run the Tor Expert Bundle. Ensure that tor.exe is active (as a background process or service).

If the Tor service is not running, RavenHunter will exit with a clear error message and instructions.

Exporting Found Links

If you wish to export the found links without scraping new data, you can use the --export flag. This will create a CSV or JSON file with all the links that have been previously scraped.

Example Export Command

python RavenHunter.py --export csv

This command will generate a CSV file with all found links in the format:

Generated by *RavenHunter*

URL: https://www.cia.gov/readingroom/docs/CIA-RDP80B01676R000200020004-6.pdf
URL: https://www.cia.gov/readingroom/docs/CIA-RDP80B01676R000100150049-4.pdf

Notes on Scraping .gov Websites

When scraping .gov websites, you must be mindful of the following:

  1. Respectful Usage:

    • Avoid excessive scraping that can overload the server. Always use a reasonable timeout between requests (e.g., 30 seconds).
    • Check the site’s robots.txt file for any scraping restrictions.
  2. Legal Considerations:

    • Ensure that your activities comply with local laws and government regulations when scraping publicly available data.
    • The Freedom of Information Act (FOIA) is often the basis for accessing documents from .gov domains, so make sure to follow the correct procedures.
  3. IP Blocking and Rate Limiting:

    • Some government sites may implement rate-limiting mechanisms (e.g., CAPTCHAs) or even block IP addresses for suspicious activity. Be prepared to handle these scenarios and implement workarounds if needed.

License

RavenHunter is licensed under the MIT License. See the LICENSE file for more details.


Note: This tool is for educational and research purposes only. Always ensure you are adhering to all applicable laws and regulations when scraping or interacting with websites.

About

RavenHunter is a tool designed to scrape all available documents ever published by the CIA under the Freedom of Information Act (FOIA).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages