Title: [BUG] Resolve PDF Text Extraction Failure on Scanned/Image-Only Documents
Is your feature request related to a problem? Please describe.
When users upload scanned PDFs (e.g., scanned invoices or printed documents saved as PDF images), PyMuPDF extracts 0 characters of text. As a result, the ingestion task creates empty text chunks, and the document cannot be searched or queried.
Describe the solution you'd like
Implement automated OCR fallback:
- Integrate an OCR library (e.g.,
pytesseract or easyocr) into the parser service.
- In
backend/app/rag/chunker.py, check if the total extracted text character count is below a threshold (e.g. 50 characters).
- If below threshold, convert the PDF pages to images and run OCR to extract text before chunking.
- Ensure appropriate system dependencies (like tesseract-ocr) are documented or added to the Dockerfile.
Describe alternatives you've considered
Rejecting scanned files, but users expect all PDF documents to work seamlessly in a document analysis platform.
Additional Context
- GSSoC '26: Yes, I am participating in GirlScript Summer of Code and would like to build this.
- Level: Critical
- Affected Files:
backend/app/rag/chunker.py, Dockerfile, backend/requirements.txt
Title: [BUG] Resolve PDF Text Extraction Failure on Scanned/Image-Only Documents
Is your feature request related to a problem? Please describe.
When users upload scanned PDFs (e.g., scanned invoices or printed documents saved as PDF images), PyMuPDF extracts 0 characters of text. As a result, the ingestion task creates empty text chunks, and the document cannot be searched or queried.
Describe the solution you'd like
Implement automated OCR fallback:
pytesseractoreasyocr) into the parser service.backend/app/rag/chunker.py, check if the total extracted text character count is below a threshold (e.g. 50 characters).Describe alternatives you've considered
Rejecting scanned files, but users expect all PDF documents to work seamlessly in a document analysis platform.
Additional Context
backend/app/rag/chunker.py,Dockerfile,backend/requirements.txt