Skip to content

Conversation

@TenthEdict
Copy link

Summary

This PR adds a local batch processor module that complements the AWS-based PDF accessibility solution by providing offline batch processing capabilities.

Features

  • OCR Enhancement: Adds invisible searchable text layers using Tesseract (via ocrmypdf)
  • PDF/UA-1 Preparation: Adds compliance metadata and markers for accessibility
  • Batch Processing: Process entire directory trees with folder structure preservation
  • Parallel Processing: Multi-threaded processing with ThreadPoolExecutor
  • Progress Tracking: Visual progress bar (tqdm) and JSON summary reports
  • CLI Interface: User-friendly command-line interface with typer/rich

Use Cases

  • Pre-processing PDFs before uploading to AWS S3
  • Development and testing workflows
  • Offline processing when AWS infrastructure is not available
  • High-volume batch jobs with parallel workers

Quick Start

# Install dependencies
cd local_batch_processor
pip install -r requirements.txt

# Batch process a directory
python -m local_batch_processor.cli batch input_folder/ output_folder/ --workers 4

Test plan

  • Install dependencies from local_batch_processor/requirements.txt
  • Test single file processing with python -m local_batch_processor.cli process
  • Test batch processing with python -m local_batch_processor.cli batch
  • Verify JSON summary report generation
  • Test parallel processing with --workers flag

🤖 Generated with Claude Code

This module provides local/offline batch processing capabilities complementing
the AWS-based PDF accessibility solution:

Features:
- OCR enhancement with Tesseract (via ocrmypdf) for invisible text layers
- PDF/UA-1 compliance preparation with metadata and markers
- Batch processing with recursive directory walking
- Folder structure preservation in output
- Parallel processing support (ThreadPoolExecutor)
- Progress tracking (tqdm) and JSON summary reports
- CLI interface with typer/rich

Use cases:
- Pre-processing PDFs before uploading to AWS S3
- Development and testing workflows
- Offline processing when AWS is not available
- High-volume batch jobs with parallel workers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant