ePubs Accessibility Enhancement

Simple, reliable PDF preprocessing for accessibility compliance

A streamlined toolkit for transforming academic and historical PDFs into accessibility-ready documents. Designed for preparation of PDFs before manual tagging in Adobe Acrobat Pro.

What This Does

This project provides two essential components for PDF accessibility workflows:

Processor - Prepares PDFs for accessibility tagging
- OCR text layer generation (invisible text for screen readers)
- Orphan tag stripping (critical for Acrobat compatibility)
- PDF/UA-1 metadata compliance
- Batch processing with folder preservation
Scraper - Downloads academic papers from digital repositories
- Purdue ETD collection support
- Automated bulk downloading
- Metadata preservation

Quick Start

Prerequisites

System Requirements:

Python 3.9+
Tesseract OCR

Install:

# macOS
brew install python tesseract

# Linux (Ubuntu/Debian)
sudo apt-get install python3 python3-pip tesseract-ocr

# Windows
choco install python tesseract

Installation

# Clone repository
git clone https://github.itap.purdue.edu/schipp0/ePubs_Main.git
cd ePubs_Main

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Verify
python3 -m src.processor.main_simple version

Usage

PDF Processing

Single file:

python3 -m src.processor.main_simple process \
    input.pdf output.pdf \
    --title "Document Title" \
    --author "Author Name" \
    --language en-US

Batch processing:

python3 -m src.processor.main_simple batch-process \
    data/input/ data/output/ \
    --workers 2

Common options:

--skip-ocr - Skip OCR if text already exists
--force-ocr - Force OCR even if text exists
--verbose - Show detailed logging
--workers N - Use N parallel workers for batch
--dpi 400 - Higher DPI for degraded scans
--ocr-lang fra - Non-English OCR (French example)

Document Scraping

Single collection:

python3 -m src.scraper.purdue_collection_downloader download \
    --url "https://docs.lib.purdue.edu/roadschool/" \
    --output ~/Downloads

Multiple collections from file:

python3 -m src.scraper.purdue_collection_downloader download \
    --url-file src/scraper/urls.txt

Common options:

--url - Single collection URL (can be used multiple times)
--url-file - Text file with URLs (one per line)
--output - Output directory (defaults to config.ini setting)
--username - Purdue username for authentication
--verbose - Show detailed logging
--dry-run - Preview what would be downloaded

Complete Workflows

Workflow 1: Process Local PDFs

# 1. Organize PDFs
mkdir -p data/input
cp ~/Documents/papers/*.pdf data/input/

# 2. Batch process
python3 -m src.processor.main_simple batch-process \
    data/input/ data/output/ \
    --workers 2 --verbose

# 3. Check results
cat data/output/batch_processing_summary.json

# 4. Open in Acrobat Pro for manual tagging
open -a "Adobe Acrobat Pro" data/output/paper1.pdf

Workflow 2: Scrape Then Process

# 1. Download papers
python3 -m src.scraper.purdue_collection_downloader download \
    --url-file src/scraper/urls.txt \
    --output data/input/

# 2. Process
python3 -m src.processor.main_simple batch-process \
    data/input/ data/output/ \
    --workers 4

# 3. Tag in Acrobat

Workflow 3: Large Batch Processing

# Process in batches with logging
python3 -m src.processor.main_simple batch-process \
    data/input/batch1/ data/output/batch1/ \
    --workers 4 --verbose 2>&1 | tee logs/batch1.log

# Check failures
jq '.failed_files' data/output/batch1/batch_processing_summary.json

Common Scenarios

Scanned Historical Documents

Old scanned PDFs with no text layer:

python3 -m src.processor.main_simple batch-process \
    scanned_docs/ enhanced_docs/ \
    --force-ocr --dpi 400 --workers 2

Modern PDFs with Text

PDFs with text needing PDF/UA compliance only:

python3 -m src.processor.main_simple batch-process \
    modern_pdfs/ compliant_pdfs/ \
    --skip-ocr --workers 4

Mixed Quality Batch

Automatic detection (100 chars/page threshold):

python3 -m src.processor.main_simple batch-process \
    mixed_pdfs/ output_pdfs/ \
    --text-threshold 100 --workers 2

Non-English Documents

French papers example:

# Install French language data first:
# macOS: brew install tesseract-lang
# Linux: sudo apt-get install tesseract-ocr-fra

python3 -m src.processor.main_simple batch-process \
    french_papers/ output/ \
    --ocr-lang fra --language fr-FR

Memory-Constrained Environment

Limited RAM with large PDFs:

python3 -m src.processor.main_simple batch-process \
    large_pdfs/ output/ \
    --workers 1

What Gets Enhanced

PDF/UA-1 Preprocessing

✅ OCR text layer (invisible - for screen readers only)
✅ Orphan tag stripping (removes malformed marked content)
✅ PDF/UA-1 metadata
✅ Document properties (title, language, author)
✅ Clean structure for Acrobat manual tagging
✅ Marked content "hooks" for Acrobat to link text to tags
❌ No auto-generated accessibility tags (do this in Acrobat)

Key Feature: Intelligent Tag Cleanup

After OCR with ocrmypdf's hocr renderer:

Preserves: Marked content operators (BDC/EMC) - these are "hooks" for Acrobat
Removes: Incomplete StructTreeRoot that interferes with manual tagging
Removes: StructParents (orphan page references)
Result: Clean PDF structure where Acrobat can link invisible text to tags

For non-OCR PDFs with orphan tags:

Removes: All malformed marked content operators
Removes: StructParents without valid structure tree
Result: Clean slate for Acrobat's tagging system

Project Structure

ePubs_Main/
├── README.md                   # This file
├── requirements.txt            # Consolidated dependencies (12 packages)
├── pyproject.toml             # Package configuration
│
├── src/
│   ├── processor/             # PDF enhancement toolkit (6 files)
│   │   ├── main_simple.py              # CLI (234 lines)
│   │   ├── simple_ocr_enhancer.py      # OCR processing
│   │   ├── simple_pdfua_enhancer.py    # PDF/UA compliance
│   │   ├── simple_enhancement_service.py  # Orchestration
│   │   └── simple_batch_processor.py   # Batch processing
│   │
│   └── scraper/               # Document acquisition
│       ├── purdue_downloader.py
│       ├── optimized_downloader.py
│       └── config.ini
│
├── data/
│   ├── input/                 # Source PDFs
│   └── output/                # Enhanced PDFs
│
├── logs/                      # Processing logs
├── tests/                     # Integration tests
│
└── archive/phase2b/           # Archived Phase 2B code (~9,000 lines)
    └── README.md

Simplified Architecture

Stage 2 & 3 Refactoring (2025-10-29):

Reduced CLI from 1,092 → 234 lines (78% reduction)
Reduced dependencies from 33+ → 12 packages (67% reduction)
Archived ~9,000 lines of Phase 2B complexity
Focus on two core components: processor + scraper

Pipeline:

Input PDF → OCR Enhancement → PDF/UA-1 Enhancement → Output PDF

Simple, sequential processing. No ML, no caching, no complex orchestration.

Core Dependencies:

PyMuPDF - PDF manipulation
pikepdf - PDF structure modification
ocrmypdf - OCR with proper text positioning
Pillow/numpy - Image processing
pytesseract - Tesseract OCR interface
typer + rich - CLI and terminal output
tqdm - Progress bars
pyyaml, python-decouple - Configuration
python-magic - File type detection

Total: 12 packages (vs 33+ in Phase 2B)

Troubleshooting

Common Errors

"ModuleNotFoundError: No module named 'src'"

# Ensure you're in project root
cd /path/to/ePubs_Main

# Activate virtual environment
source venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt

"TesseractNotFoundError"

# Verify installation
which tesseract  # Should show path

# Install if missing:
# macOS: brew install tesseract
# Linux: sudo apt-get install tesseract-ocr
# Windows: Add C:\Program Files\Tesseract-OCR to PATH

"Permission denied" errors

# Check file permissions
ls -l input.pdf

# Make readable
chmod 644 input.pdf

# Ensure output directory writable
chmod 755 data/output/

Slow processing

# Check system resources
top  # or htop

# Solutions:
# - Reduce workers: --workers 1
# - Skip OCR if not needed: --skip-ocr
# - Process smaller batches
# - Check disk I/O (SSD vs HDD)

PDFs look identical after processing

This is normal! The enhancements are structural:

OCR text layer (invisible to sighted users)
PDF/UA metadata (in document properties)
Orphan tag removal (in PDF structure)
Marked content operators (for Acrobat tagging)

Verify enhancements:

# Check PDF properties
pdfinfo output.pdf | grep -i "tagged\|pdfua"

# Or in Python:
python3 -c "import pikepdf; pdf = pikepdf.open('output.pdf'); \
print(f'Tagged: {pdf.Root.get(\"/MarkInfo\", {}).get(\"/Marked\", False)}'); \
print(f'Language: {pdf.Root.get(\"/Lang\", \"Not set\")}')"

Tips and Best Practices

Performance Tips

Worker count:
- 2-4 workers for most systems
- 1 worker for memory-constrained environments
- Don't exceed CPU core count
Skip OCR when possible:
- Modern PDFs usually have text
- Use --skip-ocr to save 70-80% processing time
Batch size:
- Process 50-100 PDFs at a time
- Monitor memory usage
- Use batch_processing_summary.json to track progress

Quality Tips

OCR quality:
- Use --dpi 400 for degraded scans
- Use --ocr-lang for non-English documents
- Clean scans = better OCR results
Metadata:
- Always provide --title for better accessibility
- Include --author when known
- Set correct --language code
Verification:
- Spot-check processed PDFs
- Open in Acrobat to verify structure
- Check batch_processing_summary.json for failures

Workflow Tips

Organization:

project/
├── 01_original/     # Source PDFs
├── 02_processed/    # After enhancement
├── 03_tagged/       # After Acrobat tagging
└── 04_final/        # Ready for distribution

Naming conventions:

Use descriptive names
Include dates or versions
Avoid spaces (use underscores)

Backup:

Always keep originals
Version processed files
Save batch_processing_summary.json reports

Next Steps After Processing

Open in Adobe Acrobat Pro
Use Tags panel to add proper structure:
- Headings (H1, H2, etc.)
- Paragraphs
- Lists
- Tables
- Figures with alt text
Run Accessibility Checker
Fix any issues
Export compliant PDF

The orphan tag stripping and PDF/UA metadata from this tool ensure clean tagging in Acrobat!

Architecture History

This project underwent major simplification:

Phase 2B (Complex):

98 Python files, 33+ dependencies
FAISS vector database, ML training, hierarchical caching
Performance monitoring dashboard
Mathematical notation detection
Result: Too complex to maintain reliably

Current (Simplified):

2 core components (processor + scraper)
12 dependencies, ~1,100 lines of new code
Focus on essential functionality
Result: Simple, reliable, maintainable

See /archive/phase2b/ for archived advanced features.

License

MIT License

Support

Issues: https://github.itap.purdue.edu/schipp0/ePubs_Main/issues
Email: schipp0@purdue.edu

Philosophy: Simple, reliable tools that prepare PDFs for manual accessibility tagging in Adobe Acrobat Pro. We handle the tedious preprocessing (OCR, metadata, tag cleanup); you handle the proper tagging.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

TenthEdict/Batch_PDF_Accessibility_Tool

Folders and files

Latest commit

History

Repository files navigation