Simple, reliable PDF preprocessing for accessibility compliance
A streamlined toolkit for transforming academic and historical PDFs into accessibility-ready documents. Designed for preparation of PDFs before manual tagging in Adobe Acrobat Pro.
This project provides two essential components for PDF accessibility workflows:
-
Processor - Prepares PDFs for accessibility tagging
- OCR text layer generation (invisible text for screen readers)
- Orphan tag stripping (critical for Acrobat compatibility)
- PDF/UA-1 metadata compliance
- Batch processing with folder preservation
-
Scraper - Downloads academic papers from digital repositories
- Purdue ETD collection support
- Automated bulk downloading
- Metadata preservation
System Requirements:
- Python 3.9+
- Tesseract OCR
Install:
# macOS
brew install python tesseract
# Linux (Ubuntu/Debian)
sudo apt-get install python3 python3-pip tesseract-ocr
# Windows
choco install python tesseract# Clone repository
git clone https://github.itap.purdue.edu/schipp0/ePubs_Main.git
cd ePubs_Main
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Verify
python3 -m src.processor.main_simple versionSingle file:
python3 -m src.processor.main_simple process \
input.pdf output.pdf \
--title "Document Title" \
--author "Author Name" \
--language en-USBatch processing:
python3 -m src.processor.main_simple batch-process \
data/input/ data/output/ \
--workers 2Common options:
--skip-ocr- Skip OCR if text already exists--force-ocr- Force OCR even if text exists--verbose- Show detailed logging--workers N- Use N parallel workers for batch--dpi 400- Higher DPI for degraded scans--ocr-lang fra- Non-English OCR (French example)
Single collection:
python3 -m src.scraper.purdue_collection_downloader download \
--url "https://docs.lib.purdue.edu/roadschool/" \
--output ~/DownloadsMultiple collections from file:
python3 -m src.scraper.purdue_collection_downloader download \
--url-file src/scraper/urls.txtCommon options:
--url- Single collection URL (can be used multiple times)--url-file- Text file with URLs (one per line)--output- Output directory (defaults to config.ini setting)--username- Purdue username for authentication--verbose- Show detailed logging--dry-run- Preview what would be downloaded
# 1. Organize PDFs
mkdir -p data/input
cp ~/Documents/papers/*.pdf data/input/
# 2. Batch process
python3 -m src.processor.main_simple batch-process \
data/input/ data/output/ \
--workers 2 --verbose
# 3. Check results
cat data/output/batch_processing_summary.json
# 4. Open in Acrobat Pro for manual tagging
open -a "Adobe Acrobat Pro" data/output/paper1.pdf# 1. Download papers
python3 -m src.scraper.purdue_collection_downloader download \
--url-file src/scraper/urls.txt \
--output data/input/
# 2. Process
python3 -m src.processor.main_simple batch-process \
data/input/ data/output/ \
--workers 4
# 3. Tag in Acrobat# Process in batches with logging
python3 -m src.processor.main_simple batch-process \
data/input/batch1/ data/output/batch1/ \
--workers 4 --verbose 2>&1 | tee logs/batch1.log
# Check failures
jq '.failed_files' data/output/batch1/batch_processing_summary.jsonOld scanned PDFs with no text layer:
python3 -m src.processor.main_simple batch-process \
scanned_docs/ enhanced_docs/ \
--force-ocr --dpi 400 --workers 2PDFs with text needing PDF/UA compliance only:
python3 -m src.processor.main_simple batch-process \
modern_pdfs/ compliant_pdfs/ \
--skip-ocr --workers 4Automatic detection (100 chars/page threshold):
python3 -m src.processor.main_simple batch-process \
mixed_pdfs/ output_pdfs/ \
--text-threshold 100 --workers 2French papers example:
# Install French language data first:
# macOS: brew install tesseract-lang
# Linux: sudo apt-get install tesseract-ocr-fra
python3 -m src.processor.main_simple batch-process \
french_papers/ output/ \
--ocr-lang fra --language fr-FRLimited RAM with large PDFs:
python3 -m src.processor.main_simple batch-process \
large_pdfs/ output/ \
--workers 1- ✅ OCR text layer (invisible - for screen readers only)
- ✅ Orphan tag stripping (removes malformed marked content)
- ✅ PDF/UA-1 metadata
- ✅ Document properties (title, language, author)
- ✅ Clean structure for Acrobat manual tagging
- ✅ Marked content "hooks" for Acrobat to link text to tags
- ❌ No auto-generated accessibility tags (do this in Acrobat)
After OCR with ocrmypdf's hocr renderer:
- Preserves: Marked content operators (BDC/EMC) - these are "hooks" for Acrobat
- Removes: Incomplete StructTreeRoot that interferes with manual tagging
- Removes: StructParents (orphan page references)
- Result: Clean PDF structure where Acrobat can link invisible text to tags
For non-OCR PDFs with orphan tags:
- Removes: All malformed marked content operators
- Removes: StructParents without valid structure tree
- Result: Clean slate for Acrobat's tagging system
ePubs_Main/
├── README.md # This file
├── requirements.txt # Consolidated dependencies (12 packages)
├── pyproject.toml # Package configuration
│
├── src/
│ ├── processor/ # PDF enhancement toolkit (6 files)
│ │ ├── main_simple.py # CLI (234 lines)
│ │ ├── simple_ocr_enhancer.py # OCR processing
│ │ ├── simple_pdfua_enhancer.py # PDF/UA compliance
│ │ ├── simple_enhancement_service.py # Orchestration
│ │ └── simple_batch_processor.py # Batch processing
│ │
│ └── scraper/ # Document acquisition
│ ├── purdue_downloader.py
│ ├── optimized_downloader.py
│ └── config.ini
│
├── data/
│ ├── input/ # Source PDFs
│ └── output/ # Enhanced PDFs
│
├── logs/ # Processing logs
├── tests/ # Integration tests
│
└── archive/phase2b/ # Archived Phase 2B code (~9,000 lines)
└── README.md
Stage 2 & 3 Refactoring (2025-10-29):
- Reduced CLI from 1,092 → 234 lines (78% reduction)
- Reduced dependencies from 33+ → 12 packages (67% reduction)
- Archived ~9,000 lines of Phase 2B complexity
- Focus on two core components: processor + scraper
Pipeline:
Input PDF → OCR Enhancement → PDF/UA-1 Enhancement → Output PDF
Simple, sequential processing. No ML, no caching, no complex orchestration.
Core Dependencies:
- PyMuPDF - PDF manipulation
- pikepdf - PDF structure modification
- ocrmypdf - OCR with proper text positioning
- Pillow/numpy - Image processing
- pytesseract - Tesseract OCR interface
- typer + rich - CLI and terminal output
- tqdm - Progress bars
- pyyaml, python-decouple - Configuration
- python-magic - File type detection
Total: 12 packages (vs 33+ in Phase 2B)
"ModuleNotFoundError: No module named 'src'"
# Ensure you're in project root
cd /path/to/ePubs_Main
# Activate virtual environment
source venv/bin/activate
# Reinstall dependencies
pip install -r requirements.txt"TesseractNotFoundError"
# Verify installation
which tesseract # Should show path
# Install if missing:
# macOS: brew install tesseract
# Linux: sudo apt-get install tesseract-ocr
# Windows: Add C:\Program Files\Tesseract-OCR to PATH"Permission denied" errors
# Check file permissions
ls -l input.pdf
# Make readable
chmod 644 input.pdf
# Ensure output directory writable
chmod 755 data/output/Slow processing
# Check system resources
top # or htop
# Solutions:
# - Reduce workers: --workers 1
# - Skip OCR if not needed: --skip-ocr
# - Process smaller batches
# - Check disk I/O (SSD vs HDD)PDFs look identical after processing
This is normal! The enhancements are structural:
- OCR text layer (invisible to sighted users)
- PDF/UA metadata (in document properties)
- Orphan tag removal (in PDF structure)
- Marked content operators (for Acrobat tagging)
Verify enhancements:
# Check PDF properties
pdfinfo output.pdf | grep -i "tagged\|pdfua"
# Or in Python:
python3 -c "import pikepdf; pdf = pikepdf.open('output.pdf'); \
print(f'Tagged: {pdf.Root.get(\"/MarkInfo\", {}).get(\"/Marked\", False)}'); \
print(f'Language: {pdf.Root.get(\"/Lang\", \"Not set\")}')"-
Worker count:
- 2-4 workers for most systems
- 1 worker for memory-constrained environments
- Don't exceed CPU core count
-
Skip OCR when possible:
- Modern PDFs usually have text
- Use
--skip-ocrto save 70-80% processing time
-
Batch size:
- Process 50-100 PDFs at a time
- Monitor memory usage
- Use batch_processing_summary.json to track progress
-
OCR quality:
- Use
--dpi 400for degraded scans - Use
--ocr-langfor non-English documents - Clean scans = better OCR results
- Use
-
Metadata:
- Always provide
--titlefor better accessibility - Include
--authorwhen known - Set correct
--languagecode
- Always provide
-
Verification:
- Spot-check processed PDFs
- Open in Acrobat to verify structure
- Check batch_processing_summary.json for failures
Organization:
project/
├── 01_original/ # Source PDFs
├── 02_processed/ # After enhancement
├── 03_tagged/ # After Acrobat tagging
└── 04_final/ # Ready for distribution
Naming conventions:
- Use descriptive names
- Include dates or versions
- Avoid spaces (use underscores)
Backup:
- Always keep originals
- Version processed files
- Save batch_processing_summary.json reports
- Open in Adobe Acrobat Pro
- Use Tags panel to add proper structure:
- Headings (H1, H2, etc.)
- Paragraphs
- Lists
- Tables
- Figures with alt text
- Run Accessibility Checker
- Fix any issues
- Export compliant PDF
The orphan tag stripping and PDF/UA metadata from this tool ensure clean tagging in Acrobat!
This project underwent major simplification:
Phase 2B (Complex):
- 98 Python files, 33+ dependencies
- FAISS vector database, ML training, hierarchical caching
- Performance monitoring dashboard
- Mathematical notation detection
- Result: Too complex to maintain reliably
Current (Simplified):
- 2 core components (processor + scraper)
- 12 dependencies, ~1,100 lines of new code
- Focus on essential functionality
- Result: Simple, reliable, maintainable
See /archive/phase2b/ for archived advanced features.
MIT License
Philosophy: Simple, reliable tools that prepare PDFs for manual accessibility tagging in Adobe Acrobat Pro. We handle the tedious preprocessing (OCR, metadata, tag cleanup); you handle the proper tagging.