OCR_PDF_TXT_extractor A simple, user-friendly Python desktop app to extract text from PDF files—whether they are selectable or scanned images—using built-in PDF parsing and OCR (Optical Character Recognition) as a fallback.
Features Easy-to-use graphical interface (Tkinter) Extracts text from standard, selectable PDFs Automatically uses OCR for scanned/image-based PDFs Saves extracted text to .txt files Progress bar and file status updates Works on Windows (requires Tesseract and Poppler)
Requirements Python 3.x PyPDF2 pdf2image pytesseract Pillow poppler (for Windows) Tesseract OCR (set path in code)
Installation Install Python dependencies:
pip install PyPDF2 pdf2image pytesseract Pillow Install Tesseract OCR: Download and install from here. Update the path in the script if Tesseract is not in your PATH: pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" Install Poppler for Windows: Download from here and update the script’s poppler_path accordingly.
Usage Run the script:
python OCR_PDF_TXT_extractor.py Click "Browse PDF" to select a PDF file.
The app will try to extract text directly. If the PDF is image-based, it will automatically use OCR. Review and edit extracted text as needed. Click "Save As" to save the output as a .txt file.
Notes For large PDFs or image-heavy files, OCR may take longer.
This app is intended for Windows; minor edits are needed for Mac/Linux (adjust Tesseract/Poppler paths).
License MIT License