A comprehensive Python program that parses PDF files and extracts their content into well-structured JSON format while preserving the hierarchical organization of the document.
- Multi-library approach: Uses PyMuPDF, pdfplumber, and Camelot for robust content extraction
- Content type detection: Automatically identifies and extracts:
- Paragraphs and text content
- Tables with accurate data preservation
- Charts and images
- Hierarchical structure: Maintains page-level organization and section/sub-section relationships
- Section identification: Automatically detects document structure based on formatting
- Modular design: Clean, well-documented code with extensible architecture
- Python 3.8 or higher
- pip (Python package installer)
Before installing Python packages, you may need to install system dependencies:
# No additional system dependencies required# Install system dependencies using Homebrew
brew install poppler# Install system dependencies
sudo apt-get update
sudo apt-get install python3-tk ghostscript poppler-utils-
Clone or download this repository
cd assignment -
Create a virtual environment (recommended)
python -m venv pdf_parser_env # Activate the virtual environment # On Windows: pdf_parser_env\Scripts\activate # On macOS/Linux: source pdf_parser_env/bin/activate
-
Install required packages
pip install -r requirements.txt
If you encounter issues with specific packages:
-
Camelot installation issues:
pip install camelot-py[cv] --no-deps pip install opencv-python pandas numpy
-
PyMuPDF installation issues:
pip install --upgrade PyMuPDF
-
Java dependency for tabula-py:
- Ensure Java is installed on your system
- On Windows: Download from Oracle or use
choco install openjdk - On macOS:
brew install openjdk - On Ubuntu:
sudo apt-get install default-jdk
python pdf_parser.py <path_to_pdf_file># Specify output file
python pdf_parser.py input.pdf -o output.json
# Enable verbose logging
python pdf_parser.py input.pdf -v
# Full example
python pdf_parser.py fund_factsheet.pdf -o extracted_fund_data.json -vpdf_file(required): Path to the PDF file to parse-o, --output: Output JSON file path (default:extracted_content.json)-v, --verbose: Enable verbose logging for debugging
from pdf_parser import PDFParser
# Parse PDF and extract content
with PDFParser("path/to/your/file.pdf") as parser:
extracted_data = parser.parse_pdf()
parser.save_json(extracted_data, "output.json")The program generates a JSON file with the following structure:
{
"document_info": {
"filename": "example.pdf",
"total_pages": 5
},
"pages": [
{
"page_number": 1,
"content": [
{
"type": "paragraph",
"section": "Introduction",
"sub_section": "Background",
"text": "This is an example paragraph extracted from the PDF..."
},
{
"type": "table",
"section": "Financial Data",
"sub_section": null,
"description": "Table with 3 rows and 3 columns",
"table_data": [
["Year", "Revenue", "Profit"],
["2022", "$10M", "$2M"],
["2023", "$12M", "$3M"]
]
},
{
"type": "chart",
"section": "Performance Overview",
"sub_section": null,
"description": "Chart/Image found on page 1",
"table_data": null
}
]
}
]
}-
Paragraph: Text content with section hierarchy
type: "paragraph"text: Extracted text contentsection: Main section namesub_section: Sub-section name (if applicable)
-
Table: Tabular data extracted from the PDF
type: "table"description: Brief description of the tabletable_data: 2D array representing table rows and columnssection/sub_section: Inferred from nearby content
-
Chart: Images and charts detected in the PDF
type: "chart"description: Description of the detected chart/imagetable_data: Chart data (if extractable, otherwise null)section/sub_section: Inferred from nearby content
Process the provided fund factsheet:
python pdf_parser.py fund_factsheet.pdf -o fund_analysis.json -vThis will:
- Parse the fund factsheet PDF
- Extract all text, tables, and charts
- Organize content hierarchically by sections
- Save the structured data to
fund_analysis.json - Display verbose progress information
The program uses a multi-library approach for maximum accuracy:
- PyMuPDF (fitz): Fast text extraction and document structure analysis
- pdfplumber: Detailed table detection and extraction
- Camelot: Advanced table extraction with computer vision
- tabula-py: Alternative table extraction method
- PDFParser Class: Main parser with context manager support
- Content Extractors: Specialized methods for different content types
- Section Detection: Automatic identification of document hierarchy
- Content Merging: Intelligent combination of different content types
- Chart data extraction is limited to detection; actual data extraction from charts requires additional OCR/ML techniques
- Complex nested tables may need manual verification
- Performance depends on PDF complexity and size
- Some PDFs with unusual formatting may require custom handling
To extend the functionality:
- Add new content extractors in the
PDFParserclass - Enhance section detection algorithms
- Improve chart data extraction capabilities
- Add support for additional PDF formats
See requirements.txt for the complete list of dependencies. Key libraries:
- PyMuPDF (fitz): Core PDF processing
- pdfplumber: Table extraction
- camelot-py: Advanced table detection
- pandas: Data manipulation
- opencv-python: Image processing support
This project is provided as-is for educational and development purposes.
For issues or questions:
- Check that all dependencies are properly installed
- Verify that your PDF is not password-protected or corrupted
- Try with verbose logging (
-v) to identify specific issues - Ensure you have sufficient system resources for large PDF files