PDF Highlight Extractor

A powerful Python tool for extracting highlighted text from PDF documents while preserving formatting information such as headers, bold text, and italics.

Features

Extracts text from highlighted areas in PDF documents
Preserves text formatting (headers, bold, italic)
Outputs formatted text in Markdown or HTML
Detects and preserves hierarchical structure of documents
Command-line interface for easy integration into workflows
Intelligent paragraph and formatting detection

Use Cases

Content research and collection
Academic paper review and note-taking
Legal document analysis and extraction
Knowledge management systems
Content migration to CMSs

Requirements

Python 3.7+
Dependencies:
- PyMuPDF (fitz) - For PDF text extraction and annotation handling
- pypdfium2 - For PDF rendering
- OpenCV (cv2) - For image processing and highlight detection
- NumPy - For array operations
- Pillow (PIL) - For image handling

Installation

# Clone the repository
git clone https://github.com/No0Bitah/PDF-Highlight-Extractor.git
cd PDF-Highlight-Extractor

# Install dependencies
python main.py --install

Alternatively, you can install dependencies manually:

pip install PyMuPDF pypdfium2 numpy opencv-python pillow

Usage

Basic Usage

python main.py --input sample.pdf --format markdown

This will process sample.pdf and save the extracted highlighted text to sample.txt in Markdown format.

Command Line Arguments

# Install dependencies
python main.py --install

# Process a PDF file with default settings (markdown output)
python main.py --input document.pdf

# Process a PDF file and specify output file
python main.py --input document.pdf --output extracted_highlights.md

# Generate HTML output
python main.py --input document.pdf --format html --output extracted_highlights.html

Using as a Library

You can also use the PDFHighlightExtractor class directly in your Python code:

from pdf_extractor import PDFHighlightExtractor

# Initialize the extractor
extractor = PDFHighlightExtractor("document.pdf")

# Run the full pipeline
formatted_text = extractor.extract_and_format(output_path="output.md", output_format="markdown")

# Or run individual steps
extractor.detect_highlights()
extractor.extract_text_from_highlights()
formatted_text = extractor.format_output(output_format="markdown")

Limitations

Currently, the tool only detects yellow highlights (RGB: 255, 255, 0). Other highlight colors are not supported yet.
The highlight detection works best on clean, well-scanned PDFs. Poor quality scans may affect detection accuracy.
Header detection is based on font size heuristics and may not be perfect for all PDF documents.

Future Improvements

Support for multiple highlight colors with color-coding in output
Improved header and structure detection
Option to extract annotations and comments
Support for PDF forms and fillable fields
Better handling of complex layouts (multi-column, mixed orientations)
CMS integration capabilities for direct publishing to content management systems
Web interface/API for remote processing
OCR integration for scanned documents
Batch processing for multiple PDFs

How It Works

The tool uses a combination of image processing techniques (with OpenCV) and PDF parsing (with PyMuPDF) to:

Detect highlighted areas by color analysis
Extract text from those areas using PDF parsing libraries
Preserve formatting information from the original document
Reconstruct the logical structure of the highlighted content
Output in the desired format (Markdown or HTML)

Troubleshooting

No highlights detected: Try adjusting the tolerance parameter for color detection
Missing formatting: Some PDFs don't store formatting as expected; manual adjustments may be needed
Performance issues with large PDFs: Process page ranges instead of the entire document

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors

🔗 No0Bitah
📧 Contact me

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
output.html		output.html
pdf_extractor.py		pdf_extractor.py
testfile.pdf		testfile.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Highlight Extractor

Features

Use Cases

Requirements

Installation

Usage

Basic Usage

Command Line Arguments

Using as a Library

Limitations

Future Improvements

How It Works

Troubleshooting

Contributing

License

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Highlight Extractor

Features

Use Cases

Requirements

Installation

Usage

Basic Usage

Command Line Arguments

Using as a Library

Limitations

Future Improvements

How It Works

Troubleshooting

Contributing

License

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages