Skip to content

tfbf/scripture_ocr

Repository files navigation

PaddleOCR-VL PDF Text & Image Extraction System

An efficient system for extracting text and images from PDF documents using PaddlePaddle's state-of-the-art PaddleOCR-VL model. This system supports 109 languages and can handle complex document elements including text, tables, formulas, and charts.

Features

  • 🔍 Advanced OCR: Uses PaddleOCR-VL (0.9B parameter model) for high-accuracy text extraction
  • 📄 Complete PDF Processing: Converts PDF pages to images and extracts embedded images
  • 🌍 Multilingual Support: Supports 109 languages including Chinese, English, Arabic, Hindi, etc.
  • 📊 Complex Elements: Handles text, tables, formulas, and charts
  • 🚀 GPU Acceleration: Supports GPU processing and VLLM server acceleration
  • 📋 Multiple Output Formats: JSON, Markdown, CSV, XML, HTML, and plain text
  • 🔄 Batch Processing: Process multiple PDF files efficiently
  • 📈 Progress Tracking: Real-time progress monitoring with detailed statistics

Installation

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (recommended)
  • poppler-utils (for PDF to image conversion)

Install System Dependencies

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install poppler-utils

# macOS (with Homebrew)
brew install poppler

# Windows
# Download poppler binaries from: https://github.com/oschwartz10612/poppler-windows/releases/

Install Python Dependencies

# Clone or download this repository
cd paddleOCR

# Install dependencies
pip install -r requirements.txt

# For Linux with CUDA support (recommended)
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install -U "paddleocr[doc-parser]"

# Additional dependency for some systems
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl

Quick Start

Process a Single PDF

python main.py input_pdfs/document.pdf

Process Multiple PDFs

python main.py input_pdfs/ -o results/

Advanced Usage

# Use CPU instead of GPU
python main.py document.pdf --use-cpu

# Skip embedded image extraction
python main.py document.pdf --no-embedded

# Use VLLM server for acceleration
python main.py document.pdf --use-vllm --vllm-url http://localhost:8080/v1

# Adjust image quality
python main.py document.pdf --dpi 600

# Enable debug logging
python main.py document.pdf --log-level DEBUG

Command Line Options

Option Description Default
input_path Path to PDF file or directory Required
-o, --output Output directory ./output
--dpi DPI for PDF to image conversion 300
--no-embedded Skip extraction of embedded images False
--use-cpu Use CPU instead of GPU False
--use-vllm Use VLLM server for acceleration False
--vllm-url VLLM server URL http://127.0.0.1:8080/v1
--log-level Logging level (DEBUG/INFO/WARNING/ERROR) INFO

Output Structure

For each processed PDF, the system creates:

output/
├── document_name/
│   ├── document_name_results.json     # Complete results in JSON
│   ├── document_name_results.md       # Human-readable Markdown report
│   ├── document_name_extracted_text.txt  # Plain text content
│   ├── document_name.csv             # Tabular data export
│   ├── document_name.xml             # XML structured export
│   ├── document_name.html            # HTML report
│   └── processed_images/             # Reference images
│       ├── document_name_page1.png
│       ├── document_name_page2.png
│       └── embedded_img1.jpg
└── batch_summary.json               # Summary for batch processing

Python API Usage

Basic Usage

from src.pdf_processor import PDFProcessor
from src.paddleocr_processor import PaddleOCRProcessor
from main import PDFOCRPipeline

# Initialize the pipeline
pipeline = PDFOCRPipeline(use_gpu=True, dpi=300)

# Process a single PDF
result = pipeline.process_single_pdf(
    pdf_path="document.pdf",
    output_dir="output/",
    extract_embedded=True
)

print(f"Processing completed! Success rate: {result['statistics']['success_rate']:.1f}%")

Advanced Configuration

# Use VLLM server acceleration
pipeline = PDFOCRPipeline(
    use_gpu=True,
    use_vllm=True,
    vl_rec_server_url="http://localhost:8080/v1",
    dpi=600
)

# Process multiple PDFs
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = pipeline.process_multiple_pdfs(pdf_files, "output/")

# Generate statistics
total_pages = sum(r['processing_summary']['total_pages'] for r in results)
print(f"Processed {total_pages} pages across {len(pdf_files)} documents")

Custom Output Formatting

from src.output_formatter import OutputFormatter

# Save in all available formats
OutputFormatter.save_all_formats(result, "output/", "my_document")

# Save specific format
OutputFormatter.to_csv(result, "output/document.csv")
OutputFormatter.to_html(result, "output/document.html")

Performance Optimization

GPU Acceleration

Ensure CUDA is properly installed:

nvidia-smi  # Check GPU availability
python -c "import paddle; print(paddle.device.cuda.device_count())"

VLLM Server Acceleration

For maximum performance, use the VLLM server:

# Start VLLM server (requires Docker)
docker run \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server

# Use with pipeline
python main.py document.pdf --use-vllm

Memory Optimization

For large PDFs or limited memory:

  • Reduce DPI: --dpi 150 (faster, lower quality)
  • Process in smaller batches
  • Use CPU mode for very large documents: --use-cpu

Supported Languages

PaddleOCR-VL supports 109 languages including:

  • Latin scripts: English, French, German, Spanish, Italian, Portuguese, etc.
  • Chinese: Simplified and Traditional Chinese
  • East Asian: Japanese, Korean
  • Cyrillic: Russian, Bulgarian, Serbian, etc.
  • Arabic scripts: Arabic, Persian, Urdu
  • Indic scripts: Hindi, Bengali, Tamil, Telugu, etc.
  • Southeast Asian: Thai, Vietnamese, Myanmar, etc.

Troubleshooting

Common Issues

  1. CUDA out of memory

    python main.py document.pdf --use-cpu
  2. Poppler not found

    # Ubuntu/Debian
    sudo apt-get install poppler-utils
  3. PaddleOCR installation issues

    pip uninstall paddleocr paddlepaddle
    pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
    pip install -U "paddleocr[doc-parser]"
  4. Low accuracy results

    • Increase DPI: --dpi 600
    • Ensure good image quality
    • Check if the language is supported

Performance Issues

  • Slow processing: Enable GPU acceleration or use VLLM server
  • Memory issues: Reduce batch size or use CPU mode
  • Quality issues: Increase DPI or check image preprocessing

Benchmarks

Performance on sample documents (RTX 4090, 300 DPI):

Document Type Pages Processing Time Accuracy
Simple Text 10 45s 99.2%
Mixed Content 5 38s 97.8%
Table Heavy 8 67s 96.5%
Formula Rich 6 52s 98.1%

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

Citation

If you use PaddleOCR-VL in your research, please cite:

@misc{cui2025paddleocrvlboostingmultilingualdocument,
      title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model}, 
      author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
      year={2025},
      eprint={2510.14528},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.14528}, 
}

For more information and updates, visit: https://github.com/PaddlePaddle/PaddleOCR

About

An OCR system optimized for reading two column Scripture images in Indic languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published