Skip to content

gzmerel/OCR_PDF_TXT_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OCR_PDF_TXT_extractor A simple, user-friendly Python desktop app to extract text from PDF files—whether they are selectable or scanned images—using built-in PDF parsing and OCR (Optical Character Recognition) as a fallback.

Features Easy-to-use graphical interface (Tkinter) Extracts text from standard, selectable PDFs Automatically uses OCR for scanned/image-based PDFs Saves extracted text to .txt files Progress bar and file status updates Works on Windows (requires Tesseract and Poppler)

Requirements Python 3.x PyPDF2 pdf2image pytesseract Pillow poppler (for Windows) Tesseract OCR (set path in code)

Installation Install Python dependencies:

pip install PyPDF2 pdf2image pytesseract Pillow Install Tesseract OCR: Download and install from here. Update the path in the script if Tesseract is not in your PATH: pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" Install Poppler for Windows: Download from here and update the script’s poppler_path accordingly.

Usage Run the script:

python OCR_PDF_TXT_extractor.py Click "Browse PDF" to select a PDF file.

The app will try to extract text directly. If the PDF is image-based, it will automatically use OCR. Review and edit extracted text as needed. Click "Save As" to save the output as a .txt file.

Notes For large PDFs or image-heavy files, OCR may take longer.

This app is intended for Windows; minor edits are needed for Mac/Linux (adjust Tesseract/Poppler paths).

License MIT License

About

A simple yet powerful tool to extract and convert text from PDF files using Optical Character Recognition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors