RavenHunter: CIA Document Scraping Tool

...is a tool designed to scrape all available documents ever published by the CIA under the Freedom of Information Act (FOIA). This guide covers installation, usage, and additional notes for advanced users, including special considerations when scraping `.gov` websites.

=== ATTENTION ===

The CIA actively blocks traffic from the TOR network when scraping and VPN providers give timeout error messages. I haven't had the time to look into other options like implementing proxy rotation yet. If you want to support me, please send me your criticism and suggestions.

Installation

Requirements

Before you begin, ensure you have Python 3.6+ installed. Then, install the required dependencies by running:

pip install -r requirements.txt

Dependencies

requests: HTTP requests library.
tqdm: Progress bar library for Python.
colorama: Cross-platform support for colored terminal text.
playwright: For browser automation and scraping.
beautifulsoup4: HTML parsing library for scraping and navigating web pages.

If you haven't already installed the necessary browsers for Playwright, run:

playwright install

This will install the required Chromium browser, as Playwright uses a headless browser to render pages for scraping.

Usage

Running the Script

You can start the script using the command line. Here are the available options:

python RavenHunter.py --help

Example Usage

To begin scraping and storing found links, run:

python RavenHunter.py

For specific tasks, use the following flags:

Export found links:
- Export links to a CSV or JSON file.
- Example:
```
python RavenHunter.py --export json
```
Download found documents:
- Download all documents found in the scraping process.
- Example:
```
python RavenHunter.py -dl
```
Resume scraping:
- Resume the scraping from the last page saved.
- Example:
```
python RavenHunter.py --resume
```

Command Line Options

Option	Description
`-dl`	Downloads found documents.
`--timeout`	Set timeout in seconds between page scrapes. (Default: 10)
`--resume`	Resumes scraping from the last scraped page.
`--export`	Export found links to either CSV or JSON.
`--verbose`	Display links while scraping.
`--tor`	Use Tor proxy (127.0.0.1:9050).

Advanced Usage

Scraping Multiple Pages

The script is designed to scrape multiple pages of documents. It uses Playwright to automate browser interaction and BeautifulSoup for parsing the HTML content. If you wish to scrape from a specific page or control the process, you can modify the resume_page variable in the script.

🧅 Using RavenHunter with Tor (--tor)

If you want to scrape CIA documents anonymously through the Tor network, you can enable the --tor option:

python RavenHunter.py --tor

This will use a SOCKS5 proxy at 127.0.0.1:9050 (default Tor config).

❗ Important:

Make sure the Tor service is running locally before using this option.
Linux: Install and start Tor via:

sudo apt install tor
sudo systemctl start tor

Windows: Download and run the Tor Expert Bundle. Ensure that tor.exe is active (as a background process or service).

If the Tor service is not running, RavenHunter will exit with a clear error message and instructions.

Exporting Found Links

If you wish to export the found links without scraping new data, you can use the --export flag. This will create a CSV or JSON file with all the links that have been previously scraped.

Example Export Command

python RavenHunter.py --export csv

This command will generate a CSV file with all found links in the format:

Generated by *RavenHunter*

URL: https://www.cia.gov/readingroom/docs/CIA-RDP80B01676R000200020004-6.pdf
URL: https://www.cia.gov/readingroom/docs/CIA-RDP80B01676R000100150049-4.pdf

Notes on Scraping `.gov` Websites

When scraping .gov websites, you must be mindful of the following:

Respectful Usage:
- Avoid excessive scraping that can overload the server. Always use a reasonable timeout between requests (e.g., 30 seconds).
- Check the site’s robots.txt file for any scraping restrictions.
Legal Considerations:
- Ensure that your activities comply with local laws and government regulations when scraping publicly available data.
- The Freedom of Information Act (FOIA) is often the basis for accessing documents from .gov domains, so make sure to follow the correct procedures.
IP Blocking and Rate Limiting:
- Some government sites may implement rate-limiting mechanisms (e.g., CAPTCHAs) or even block IP addresses for suspicious activity. Be prepared to handle these scenarios and implement workarounds if needed.

License

RavenHunter is licensed under the MIT License. See the LICENSE file for more details.

Note: This tool is for educational and research purposes only. Always ensure you are adhering to all applicable laws and regulations when scraping or interacting with websites.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LICENSE		LICENSE
README.md		README.md
RavenHunter.py		RavenHunter.py
logo.png		logo.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RavenHunter: CIA Document Scraping Tool

=== ATTENTION ===

Table of Contents

Installation

Requirements

Dependencies

Usage

Running the Script

Example Usage

Command Line Options

Advanced Usage

Scraping Multiple Pages

🧅 Using RavenHunter with Tor (--tor)

Exporting Found Links

Example Export Command

Notes on Scraping `.gov` Websites

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RavenHunter: CIA Document Scraping Tool

=== ATTENTION ===

Table of Contents

Installation

Requirements

Dependencies

Usage

Running the Script

Example Usage

Command Line Options

Advanced Usage

Scraping Multiple Pages

🧅 Using RavenHunter with Tor (--tor)

Exporting Found Links

Example Export Command

Notes on Scraping .gov Websites

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Notes on Scraping `.gov` Websites

Packages