A comprehensive document processing pipeline built in Jupyter Notebook that transforms scanned documents, PDFs, and images into fully searchable, metadata-rich, and forensically verified outputs. Designed for document digitization, forensic analysis, and accessibility enhancement.
- Input Formats: PDF, TIFF, JPEG, PNG, ZIP archives
- Output Format: Searchable PDF with embedded OCR text layer
- Automatic format normalization to ensure consistent pipeline processing
- High-Precision OCR: Uses Tesseract with 300 DPI conversion for accuracy
- Deep Metadata Extraction: Extracts PDF metadata, XMP data, EXIF tags, and file signatures
- Cryptographic Hashing: SHA-256 hash calculation for document integrity verification
- Forensic Support: Hex signature extraction and comprehensive metadata logging
- Interactive GUI for reviewing and correcting OCR-extracted text
- Page-by-page verification interface
- Quality control before final document finalization
- Embeds invisible OCR text layer into PDFs for full-text search capability
- Maintains document appearance while adding searchable content
- Preserves original file integrity with validation checksums
- Process multiple documents in a single session
- Support for ZIP archive extraction and processing
- Organized export with metadata alongside processed documents
- Automated workspace cleanup for batch resets
1. Environment Setup → Install dependencies (Tesseract, OCR tools, Python libraries)
2. File Upload → Upload documents (PDF, TIFF, images, or ZIP files)
3. Archive Extraction → Automatically extract ZIP files and validate contents
4. Format Normalization → Convert all formats (TIFF, images) to standardized PDF
5. Metadata & Hash Extraction → Extract deep metadata and calculate document hash
6. High-Precision OCR → Convert pages to 300 DPI images and extract text
7. Human Verification → Review and correct OCR text via interactive interface
8. Embed Hidden Text Layer → Inject verified OCR text as invisible searchable layer
9. Final Export → Package processed PDFs with metadata and provide download
- Python 3.x (Google Colab recommended)
- Tesseract OCR: System-level dependency
- Supporting Libraries:
pytesseract,pdf2image,PyMuPDF,pikepdf,pdfplumber,Pillow,exifread
-
Open the Notebook
Click the "Open in Colab" badge or visit the repository'sDocument_Processing_Pipeline.ipynb -
Run Environment Setup
Execute the first code cell to install all system and Python dependencies -
Follow Sequential Cells
Run cells in order from top to bottom (they are numbered 1–9)
# Clone the repository
git clone https://github.com/carlymariec/document_processing_pipeline.git
cd document_processing_pipeline
# Install dependencies
pip install pytesseract pdf2image PyMuPDF pikepdf pdfplumber Pillow exifread ipywidgets
# Install Tesseract (macOS)
brew install tesseract
# Install Tesseract (Ubuntu/Debian)
sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev
# Launch Jupyter Notebook
jupyter notebook Document_Processing_Pipeline.ipynb- Click the Upload Documents button
- Select multiple files: PDFs, TIFFs, images (JPG/PNG), or ZIP archives
- Files are hashed and logged immediately for integrity tracking
- ZIP files are automatically extracted
- Valid documents (PDF, TIFF, JPG, PNG) are identified and processed
- Invalid files are skipped with error logging
- Images and TIFF files are converted to PDF format
- PDFs remain unchanged
- Enables unified processing downstream
- Extracts PDF metadata, XMP data, and EXIF information
- Calculates SHA-256 hash for each document
- Logs all metadata to
*_metadata.txtfiles
- Converts PDF pages to 300 DPI high-resolution images
- Uses Tesseract to extract text with high precision
- Progress shown page-by-page
- Review extracted OCR text page-by-page
- Edit text directly in the interface if needed
- Mark pages as approved or corrected
- Injects verified OCR text as an invisible layer in the PDF
- Text is hidden but fully searchable and selectable
- Document appearance remains unchanged
- Click Download Processed Files to get a ZIP archive
- Archive contains:
- Searchable PDFs (
*_searchable.pdf) - Metadata logs with hashes (
*_metadata.txt) - Complete audit trail
- Searchable PDFs (
- Run the workspace erasure function to prepare for the next batch
- Removes all temporary and processed files
- Resets the registry for fresh processing
| File Type | Description |
|---|---|
*_searchable.pdf |
Final processed PDF with embedded OCR text layer |
*_metadata.txt |
Comprehensive metadata log including hash values |
*_normalized.pdf |
Intermediate PDF after format normalization |
Forensic_Processed_Docs.zip |
Complete batch export ready for download |
-
calculate_hash(file_path, algorithm='sha256')
Computes cryptographic hash for document integrity verification -
extract_metadata(file_path)
Extracts deep metadata, EXIF, hex signatures, and file headers -
normalize_to_pdf(registry)
Converts all image formats to standardized PDFs -
perform_ocr(registry)
Runs high-precision Tesseract OCR at 300 DPI resolution -
embed_hidden_text(registry)
Injects invisible searchable OCR text layer into PDFs -
export_results(registry)
Packages final outputs and metadata for batch download -
erase_workspace()
Clears all files and resets for new batch processing
- Document Digitization: Convert paper documents to searchable digital format
- Forensic Analysis: Extract and verify metadata for legal investigations
- Accessibility Enhancement: Make scanned documents searchable for all users
- Records Management: Batch process and organize archived documents
- Research: Extract text from historical documents while preserving originals
- Resolution: 300 DPI (high precision)
- Engine: Tesseract OCR with English language model
- Output: Character-level text extraction
- Default: SHA-256
- Purpose: Verify document integrity and detect tampering
- Processed through
pikepdffor structural validation and optimization - Maintains compatibility with standard PDF readers
| Library | Purpose |
|---|---|
pytesseract |
Tesseract OCR interface |
pdf2image |
PDF to image conversion |
PyMuPDF (fitz) |
PDF creation and manipulation |
pikepdf |
PDF validation and optimization |
pdfplumber |
PDF text extraction (future enhancement) |
Pillow |
Image processing |
exifread |
EXIF metadata extraction |
ipywidgets |
Interactive UI components |
- OCR Accuracy: Depends on document quality; handwritten text may not extract well
- Processing Time: Large documents or high batch sizes take proportional time
- Google Colab Quotas: Free tier has storage and runtime limits
- File Size: Individual files should be under 100MB for optimal performance
- Language Support: Currently configured for English; other languages require Tesseract model installation
| Issue | Solution |
|---|---|
| Tesseract not found | Ensure system-level installation completed in first cell |
| OCR quality poor | Check PDF quality; verify 300 DPI conversion ran successfully |
| Memory errors on large batches | Process fewer files per batch; clear workspace between runs |
| Metadata extraction fails | File may be corrupted; try re-uploading |
| Searchable PDF not working | Verify hidden text embedding completed; check PDF with different reader |
- Multi-language OCR support
- Machine learning-based document classification
- Advanced text segmentation and layout analysis
- Batch API endpoint for programmatic access
- Direct integration with cloud storage services
- Performance optimization for very large batches
This project is open source. Please check the repository for the specific license.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request with detailed descriptions
For bugs, feature requests, or questions:
- Open an issue on the GitHub repository
- Include error logs and sample files when applicable
If you use this pipeline in your research or work, please cite:
Document Processing Pipeline by Carly Marie C.
https://github.com/carlymariec/document_processing_pipeline
Last Updated: June 2026
Python Version: 3.x
Primary Environment: Google Colab
Notebook Version: Latest (Document_Processing_Pipeline.ipynb)