A powerful Python tool for extracting highlighted text from PDF documents while preserving formatting information such as headers, bold text, and italics.
- Extracts text from highlighted areas in PDF documents
- Preserves text formatting (headers, bold, italic)
- Outputs formatted text in Markdown or HTML
- Detects and preserves hierarchical structure of documents
- Command-line interface for easy integration into workflows
- Intelligent paragraph and formatting detection
- Content research and collection
- Academic paper review and note-taking
- Legal document analysis and extraction
- Knowledge management systems
- Content migration to CMSs
- Python 3.7+
- Dependencies:
- PyMuPDF (fitz) - For PDF text extraction and annotation handling
- pypdfium2 - For PDF rendering
- OpenCV (cv2) - For image processing and highlight detection
- NumPy - For array operations
- Pillow (PIL) - For image handling
# Clone the repository
git clone https://github.com/No0Bitah/PDF-Highlight-Extractor.git
cd PDF-Highlight-Extractor
# Install dependencies
python main.py --installAlternatively, you can install dependencies manually:
pip install PyMuPDF pypdfium2 numpy opencv-python pillowpython main.py --input sample.pdf --format markdownThis will process sample.pdf and save the extracted highlighted text to sample.txt in Markdown format.
# Install dependencies
python main.py --install
# Process a PDF file with default settings (markdown output)
python main.py --input document.pdf
# Process a PDF file and specify output file
python main.py --input document.pdf --output extracted_highlights.md
# Generate HTML output
python main.py --input document.pdf --format html --output extracted_highlights.htmlYou can also use the PDFHighlightExtractor class directly in your Python code:
from pdf_extractor import PDFHighlightExtractor
# Initialize the extractor
extractor = PDFHighlightExtractor("document.pdf")
# Run the full pipeline
formatted_text = extractor.extract_and_format(output_path="output.md", output_format="markdown")
# Or run individual steps
extractor.detect_highlights()
extractor.extract_text_from_highlights()
formatted_text = extractor.format_output(output_format="markdown")- Currently, the tool only detects yellow highlights (RGB: 255, 255, 0). Other highlight colors are not supported yet.
- The highlight detection works best on clean, well-scanned PDFs. Poor quality scans may affect detection accuracy.
- Header detection is based on font size heuristics and may not be perfect for all PDF documents.
- Support for multiple highlight colors with color-coding in output
- Improved header and structure detection
- Option to extract annotations and comments
- Support for PDF forms and fillable fields
- Better handling of complex layouts (multi-column, mixed orientations)
- CMS integration capabilities for direct publishing to content management systems
- Web interface/API for remote processing
- OCR integration for scanned documents
- Batch processing for multiple PDFs
The tool uses a combination of image processing techniques (with OpenCV) and PDF parsing (with PyMuPDF) to:
- Detect highlighted areas by color analysis
- Extract text from those areas using PDF parsing libraries
- Preserve formatting information from the original document
- Reconstruct the logical structure of the highlighted content
- Output in the desired format (Markdown or HTML)
- No highlights detected: Try adjusting the
toleranceparameter for color detection - Missing formatting: Some PDFs don't store formatting as expected; manual adjustments may be needed
- Performance issues with large PDFs: Process page ranges instead of the entire document
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- 🔗 No0Bitah
- 📧 Contact me