...is a tool designed to scrape all available documents ever published by the CIA under the Freedom of Information Act (FOIA). This guide covers installation, usage, and additional notes for advanced users, including special considerations when scraping `.gov` websites.
The CIA actively blocks traffic from the TOR network when scraping and VPN providers give timeout error messages. I haven't had the time to look into other options like implementing proxy rotation yet. If you want to support me, please send me your criticism and suggestions.
Before you begin, ensure you have Python 3.6+ installed. Then, install the required dependencies by running:
pip install -r requirements.txt- requests: HTTP requests library.
- tqdm: Progress bar library for Python.
- colorama: Cross-platform support for colored terminal text.
- playwright: For browser automation and scraping.
- beautifulsoup4: HTML parsing library for scraping and navigating web pages.
If you haven't already installed the necessary browsers for Playwright, run:
playwright installThis will install the required Chromium browser, as Playwright uses a headless browser to render pages for scraping.
You can start the script using the command line. Here are the available options:
python RavenHunter.py --helpTo begin scraping and storing found links, run:
python RavenHunter.pyFor specific tasks, use the following flags:
-
Export found links:
- Export links to a CSV or JSON file.
- Example:
python RavenHunter.py --export json
-
Download found documents:
- Download all documents found in the scraping process.
- Example:
python RavenHunter.py -dl
-
Resume scraping:
- Resume the scraping from the last page saved.
- Example:
python RavenHunter.py --resume
| Option | Description |
|---|---|
-dl |
Downloads found documents. |
--timeout |
Set timeout in seconds between page scrapes. (Default: 10) |
--resume |
Resumes scraping from the last scraped page. |
--export |
Export found links to either CSV or JSON. |
--verbose |
Display links while scraping. |
--tor |
Use Tor proxy (127.0.0.1:9050). |
The script is designed to scrape multiple pages of documents. It uses Playwright to automate browser interaction and BeautifulSoup for parsing the HTML content. If you wish to scrape from a specific page or control the process, you can modify the resume_page variable in the script.
If you want to scrape CIA documents anonymously through the Tor network, you can enable the --tor option:
python RavenHunter.py --torThis will use a SOCKS5 proxy at 127.0.0.1:9050 (default Tor config).
❗ Important:
-
Make sure the Tor service is running locally before using this option.
-
Linux: Install and start Tor via:
sudo apt install tor
sudo systemctl start tor- Windows:
Download and run the Tor Expert Bundle.
Ensure that
tor.exeis active (as a background process or service).
If the Tor service is not running, RavenHunter will exit with a clear error message and instructions.
If you wish to export the found links without scraping new data, you can use the --export flag. This will create a CSV or JSON file with all the links that have been previously scraped.
python RavenHunter.py --export csvThis command will generate a CSV file with all found links in the format:
Generated by *RavenHunter*
URL: https://www.cia.gov/readingroom/docs/CIA-RDP80B01676R000200020004-6.pdf
URL: https://www.cia.gov/readingroom/docs/CIA-RDP80B01676R000100150049-4.pdf
When scraping .gov websites, you must be mindful of the following:
-
Respectful Usage:
- Avoid excessive scraping that can overload the server. Always use a reasonable
timeoutbetween requests (e.g.,30seconds). - Check the site’s
robots.txtfile for any scraping restrictions.
- Avoid excessive scraping that can overload the server. Always use a reasonable
-
Legal Considerations:
- Ensure that your activities comply with local laws and government regulations when scraping publicly available data.
- The Freedom of Information Act (FOIA) is often the basis for accessing documents from
.govdomains, so make sure to follow the correct procedures.
-
IP Blocking and Rate Limiting:
- Some government sites may implement rate-limiting mechanisms (e.g., CAPTCHAs) or even block IP addresses for suspicious activity. Be prepared to handle these scenarios and implement workarounds if needed.
RavenHunter is licensed under the MIT License. See the LICENSE file for more details.
Note: This tool is for educational and research purposes only. Always ensure you are adhering to all applicable laws and regulations when scraping or interacting with websites.