Skip to content

[Task] Resolve PDF Text Extraction Failure on Scanned/Image-Only Documents #566

@knoxiboy

Description

@knoxiboy

Title: [BUG] Resolve PDF Text Extraction Failure on Scanned/Image-Only Documents

Is your feature request related to a problem? Please describe.

When users upload scanned PDFs (e.g., scanned invoices or printed documents saved as PDF images), PyMuPDF extracts 0 characters of text. As a result, the ingestion task creates empty text chunks, and the document cannot be searched or queried.

Describe the solution you'd like

Implement automated OCR fallback:

  1. Integrate an OCR library (e.g., pytesseract or easyocr) into the parser service.
  2. In backend/app/rag/chunker.py, check if the total extracted text character count is below a threshold (e.g. 50 characters).
  3. If below threshold, convert the PDF pages to images and run OCR to extract text before chunking.
  4. Ensure appropriate system dependencies (like tesseract-ocr) are documented or added to the Dockerfile.

Describe alternatives you've considered

Rejecting scanned files, but users expect all PDF documents to work seamlessly in a document analysis platform.

Additional Context

  • GSSoC '26: Yes, I am participating in GirlScript Summer of Code and would like to build this.
  • Level: Critical
  • Affected Files: backend/app/rag/chunker.py, Dockerfile, backend/requirements.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    gssocGirlScript Summer of Code 2026 issue/PR

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions