PDF Parser and JSON Extractor

A comprehensive Python program that parses PDF files and extracts their content into well-structured JSON format while preserving the hierarchical organization of the document.

Features

Multi-library approach: Uses PyMuPDF, pdfplumber, and Camelot for robust content extraction
Content type detection: Automatically identifies and extracts:
- Paragraphs and text content
- Tables with accurate data preservation
- Charts and images
Hierarchical structure: Maintains page-level organization and section/sub-section relationships
Section identification: Automatically detects document structure based on formatting
Modular design: Clean, well-documented code with extensible architecture

Installation

Prerequisites

Python 3.8 or higher
pip (Python package installer)

System Dependencies

Before installing Python packages, you may need to install system dependencies:

Windows

# No additional system dependencies required

macOS

# Install system dependencies using Homebrew
brew install poppler

Ubuntu/Debian

# Install system dependencies
sudo apt-get update
sudo apt-get install python3-tk ghostscript poppler-utils

Python Dependencies

Clone or download this repository
```
cd assignment
```

Create a virtual environment (recommended)

python -m venv pdf_parser_env

# Activate the virtual environment
# On Windows:
pdf_parser_env\Scripts\activate

# On macOS/Linux:
source pdf_parser_env/bin/activate

Install required packages
```
pip install -r requirements.txt
```

Troubleshooting Installation

If you encounter issues with specific packages:

Camelot installation issues:

pip install camelot-py[cv] --no-deps
pip install opencv-python pandas numpy

PyMuPDF installation issues:
```
pip install --upgrade PyMuPDF
```
Java dependency for tabula-py:
- Ensure Java is installed on your system
- On Windows: Download from Oracle or use choco install openjdk
- On macOS: brew install openjdk
- On Ubuntu: sudo apt-get install default-jdk

Usage

Basic Usage

python pdf_parser.py <path_to_pdf_file>

Advanced Usage

# Specify output file
python pdf_parser.py input.pdf -o output.json

# Enable verbose logging
python pdf_parser.py input.pdf -v

# Full example
python pdf_parser.py fund_factsheet.pdf -o extracted_fund_data.json -v

Command Line Arguments

pdf_file (required): Path to the PDF file to parse
-o, --output: Output JSON file path (default: extracted_content.json)
-v, --verbose: Enable verbose logging for debugging

Using as a Python Module

from pdf_parser import PDFParser

# Parse PDF and extract content
with PDFParser("path/to/your/file.pdf") as parser:
    extracted_data = parser.parse_pdf()
    parser.save_json(extracted_data, "output.json")

Output Format

The program generates a JSON file with the following structure:

{
  "document_info": {
    "filename": "example.pdf",
    "total_pages": 5
  },
  "pages": [
    {
      "page_number": 1,
      "content": [
        {
          "type": "paragraph",
          "section": "Introduction",
          "sub_section": "Background",
          "text": "This is an example paragraph extracted from the PDF..."
        },
        {
          "type": "table",
          "section": "Financial Data",
          "sub_section": null,
          "description": "Table with 3 rows and 3 columns",
          "table_data": [
            ["Year", "Revenue", "Profit"],
            ["2022", "$10M", "$2M"],
            ["2023", "$12M", "$3M"]
          ]
        },
        {
          "type": "chart",
          "section": "Performance Overview",
          "sub_section": null,
          "description": "Chart/Image found on page 1",
          "table_data": null
        }
      ]
    }
  ]
}

Content Types

Paragraph: Text content with section hierarchy
- type: "paragraph"
- text: Extracted text content
- section: Main section name
- sub_section: Sub-section name (if applicable)
Table: Tabular data extracted from the PDF
- type: "table"
- description: Brief description of the table
- table_data: 2D array representing table rows and columns
- section/sub_section: Inferred from nearby content
Chart: Images and charts detected in the PDF
- type: "chart"
- description: Description of the detected chart/image
- table_data: Chart data (if extractable, otherwise null)
- section/sub_section: Inferred from nearby content

Example

Process the provided fund factsheet:

python pdf_parser.py fund_factsheet.pdf -o fund_analysis.json -v

This will:

Parse the fund factsheet PDF
Extract all text, tables, and charts
Organize content hierarchically by sections
Save the structured data to fund_analysis.json
Display verbose progress information

Architecture

The program uses a multi-library approach for maximum accuracy:

PyMuPDF (fitz): Fast text extraction and document structure analysis
pdfplumber: Detailed table detection and extraction
Camelot: Advanced table extraction with computer vision
tabula-py: Alternative table extraction method

Key Components

PDFParser Class: Main parser with context manager support
Content Extractors: Specialized methods for different content types
Section Detection: Automatic identification of document hierarchy
Content Merging: Intelligent combination of different content types

Limitations

Chart data extraction is limited to detection; actual data extraction from charts requires additional OCR/ML techniques
Complex nested tables may need manual verification
Performance depends on PDF complexity and size
Some PDFs with unusual formatting may require custom handling

Contributing

To extend the functionality:

Add new content extractors in the PDFParser class
Enhance section detection algorithms
Improve chart data extraction capabilities
Add support for additional PDF formats

Dependencies

See requirements.txt for the complete list of dependencies. Key libraries:

PyMuPDF (fitz): Core PDF processing
pdfplumber: Table extraction
camelot-py: Advanced table detection
pandas: Data manipulation
opencv-python: Image processing support

License

This project is provided as-is for educational and development purposes.

Support

For issues or questions:

Check that all dependencies are properly installed
Verify that your PDF is not password-protected or corrupted
Try with verbose logging (-v) to identify specific issues
Ensure you have sufficient system resources for large PDF files

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
README.md		README.md
demo.py		demo.py
demo_output.json		demo_output.json
fund_analysis.json		fund_analysis.json
fund_factsheet.pdf		fund_factsheet.pdf
install.py		install.py
pdf_parser.py		pdf_parser.py
requirements.txt		requirements.txt
test_output.json		test_output.json
test_parser.py		test_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Parser and JSON Extractor

Features

Installation

Prerequisites

System Dependencies

Windows

macOS

Ubuntu/Debian

Python Dependencies

Troubleshooting Installation

Usage

Basic Usage

Advanced Usage

Command Line Arguments

Using as a Python Module

Output Format

Content Types

Example

Architecture

Key Components

Limitations

Contributing

Dependencies

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Parser and JSON Extractor

Features

Installation

Prerequisites

System Dependencies

Windows

macOS

Ubuntu/Debian

Python Dependencies

Troubleshooting Installation

Usage

Basic Usage

Advanced Usage

Command Line Arguments

Using as a Python Module

Output Format

Content Types

Example

Architecture

Key Components

Limitations

Contributing

Dependencies

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages