A simple, Archive.org-style document browser for congressional records released by Congress.
- Installation Guide - Complete setup instructions for running your own mirror
- API Documentation - Full API reference with examples
- Usage Guide - How to use the document browser
- Features Overview - Key features and capabilities
- Official Context - Congressional sources and official document releases
- Go to the Google Drive folder: https://drive.google.com/drive/folders/1TrGxDGQLDLZu1vvvZDBAh-e7wN3y6Hoz
- Select all files (Ctrl+A or Cmd+A)
- Right-click and choose "Download" - this will create a ZIP file (~71 GB)
- Extract the ZIP file to a folder named
datain your project directory - Verify the structure: You should have
data/Prod 01_20250822/VOL00001/IMAGES/with 12 IMAGES subdirectories
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Run the image indexer (this will scan all 33,572 images)
python index_images.py
# You should see output like:
# π Epstein Documents Image Indexer
# ==================================================
# Indexed 33,572 images...
# β
Indexing complete!# If you have PDF documents to process, use the PDF explosion tool
python helpers/explode_pdfs.py data/pdf-documents/ data/images/processed/
# Run OCR processing on the new images
python ocr_processor.py --input-dir data/images/processed/
# Index the processed images into the database
python index_images.py --input-dir data/images/processed/# Start the Flask web server
python app.py
# You should see output like:
# π Starting Epstein Documents Browser...
# π Browse: http://localhost:8080
# π Stats: http://localhost:8080/api/stats
# Press Ctrl+C to stop the server- Homepage: http://localhost:8080
- API Stats: http://localhost:8080/api/stats
- Document Viewer: http://localhost:8080/view/1
- 33,572 images total
- 30 images with OCR (0.1% processed)
- 1 volume (VOL00001)
- 12 IMAGES subdirectories (IMAGES001 through IMAGES012)
- Clean interface - Dark theme with document focus
- Simple navigation - Previous/Next buttons and keyboard shortcuts
- Progress tracking - Visual progress bar showing position in document set
- Zoom controls - Click to zoom, or use +/- buttons
- Fullscreen mode - Press F11 or click fullscreen button
- PDF Explosion - Convert multi-page PDFs to individual images
- Sequential Numbering - Maintain consistent DOJ-OGR naming format
- Quality Assessment - Automatic detection of poor scan quality
- Batch Processing - Handle large document collections efficiently
- Error Detection - Identify and flag pages needing manual review
- Multi-Engine Support - EasyOCR + Tesseract for optimal results
- Quality Scoring - Automatic assessment of OCR accuracy
- Rescan Logic - Automatic reprocessing of poor quality pages
- Context-Aware Search - Full-text search with text excerpts
- Export Options - Multiple output formats for processed text
- Arrow Keys - Navigate between documents
- Home/End - Jump to first/last document
- +/- - Zoom in/out
- Escape - Exit fullscreen
- Filename search - Find documents by name
- Content search - Full-text search across OCR results
- Context excerpts - See search matches with surrounding text
- Quick navigation - Jump to first, middle, or last document
- Random document - Browse randomly
- Statistics - Real-time OCR progress tracking
The system now includes comprehensive PDF processing capabilities for handling new document dumps:
# Convert PDFs to individual images with proper naming
python helpers/explode_pdfs.py data/new-pdf-dump/ data/images/processed/
# Options:
# --dpi 600 # Higher DPI for better quality
# --start-id 50000 # Custom starting ID# Process images with OCR
python ocr_processor.py --input-dir data/images/processed/
# Options:
# --quality-check # Enable quality assessment
# --rescan-poor # Automatically rescan poor quality pages# Index processed images into searchable database
python index_images.py --input-dir data/images/processed/
# Options:
# --source "9-8-25-release" # Tag the source
# --quality-threshold 30 # Set quality threshold- Sequential Numbering: Maintains DOJ-OGR-00000001.jpg format
- Quality Assessment: Automatic detection of poor scan quality
- Batch Processing: Handle large document collections efficiently
- Error Detection: Identify pages needing manual review
- Mapping Files: Track PDF-to-image conversion for reference
epstein-release/
βββ data/ # Document images
β βββ Prod 01_20250822/ # Original congressional documents
β β βββ VOL00001/
β β βββ IMAGES/
β β βββ IMAGES001/ # ~3,173 images
β β βββ IMAGES002/ # ~3,014 images
β β βββ ... # 12 total directories
β βββ 9-8-25-release/ # New Epstein Estate documents
β β βββ Request No. 1.pdf # 238 pages
β β βββ Request No. 2.pdf # 10 pages
β β βββ Request No. 4.pdf # 9 pages
β β βββ Request No. 8.pdf # 99 pages
β βββ images/ # Processed images
β βββ 9-8-25-release/ # Converted PDF pages
β βββ DOJ-OGR-00033296.jpg
β βββ DOJ-OGR-00033297.jpg
β βββ ... # 356 total images
βββ helpers/ # Utility scripts
β βββ explode_pdfs.py # PDF to images converter
β βββ venice_integration.py # AI/LLM integration
β βββ venice_sdk/ # Venice AI SDK
βββ templates/ # HTML templates
β βββ base.html # Base template
β βββ index.html # Homepage
β βββ viewer.html # Document viewer
βββ tests/ # Test suite
β βββ unit/ # Unit tests
β βββ integration/ # Integration tests
β βββ e2e/ # End-to-end tests
βββ docs/ # Documentation
β βββ ERROR_DETECTION_RESCAN_PLAN.md
βββ app.py # Flask web application
βββ index_images.py # Database indexer
βββ ocr_processor.py # OCR processing
βββ images.db # SQLite database
βββ README.md # This file
- images table - All document metadata (33,572 records)
- directories table - Directory structure (21 directories)
- Indexes - Fast queries by path, volume, type
- Flask - Python web framework
- Bootstrap 5 - Responsive UI
- Font Awesome - Icons
- SQLite - Local database
- Pillow (PIL) - TIF to JPEG conversion
- TIF files - Automatically converted to JPEG for browser compatibility
- JPG files - Served directly
- Quality - 85% JPEG quality for optimal file size vs. readability
Images not displaying:
- Make sure you've run
python index_images.pyfirst - Check that the
datafolder exists and contains the documents - Verify the database file
images.dbwas created
TIF files showing as broken:
- This is now fixed! TIF files are automatically converted to JPEG
- If you still see issues, restart the Flask app:
python app.py
Port 8080 already in use:
- Change the port in
app.pyline 218:port=8080toport=8081 - Or stop the other service using port 8080
Virtual environment issues:
- Make sure you've activated the venv:
venv\Scripts\activate(Windows) - Install dependencies:
pip install -r requirements.txt
Database errors:
- Delete
images.dband runpython index_images.pyagain - Make sure you have write permissions in the project directory
- Code: AGPLv3
- Content: CC-BY-SA-4.0
- Original Documents: Public Domain (Congressional Records)
- OCR Processing - Run TrOCR on all 33,572 images
- Text Search - Full-text search across OCR results
- VPS Deployment - 24/7 processing and hosting
- Analysis Tools - Redaction analysis and document categorization
Original documents: https://drive.google.com/drive/folders/1TrGxDGQLDLZu1vvvZDBAh-e7wN3y6Hoz
Note: This is a simple, functional document browser. No complex features, no over-engineering - just clean, fast document browsing like Archive.org.
This application uses environment variables for configuration. Copy .env.example to .env and customize the values for your environment.
-
Copy the example environment file:
cp .env.example .env
-
Edit
.envwith your production values:nano .env
-
Important: Never commit
.envto git - it contains sensitive information!
FLASK_ENV: Set toproductionfor production deploymentSECRET_KEY: Strong secret key for Flask sessions (generate withopenssl rand -hex 32)DATABASE_PATH: Path to the SQLite database fileDATA_DIR: Directory containing the document imagesHOST: Server host (use127.0.0.1for nginx proxy)PORT: Server portDEBUG: Enable/disable debug modeTESTING: Enable/disable testing mode
- The
.envfile is automatically ignored by git - Use strong, unique secret keys in production
- Never share your
.envfile or commit it to version control