Skip to content

hriday2847/allitus_code

Repository files navigation

PDF Parser and JSON Extractor

A comprehensive Python program that parses PDF files and extracts their content into well-structured JSON format while preserving the hierarchical organization of the document.

Features

  • Multi-library approach: Uses PyMuPDF, pdfplumber, and Camelot for robust content extraction
  • Content type detection: Automatically identifies and extracts:
    • Paragraphs and text content
    • Tables with accurate data preservation
    • Charts and images
  • Hierarchical structure: Maintains page-level organization and section/sub-section relationships
  • Section identification: Automatically detects document structure based on formatting
  • Modular design: Clean, well-documented code with extensible architecture

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)

System Dependencies

Before installing Python packages, you may need to install system dependencies:

Windows

# No additional system dependencies required

macOS

# Install system dependencies using Homebrew
brew install poppler

Ubuntu/Debian

# Install system dependencies
sudo apt-get update
sudo apt-get install python3-tk ghostscript poppler-utils

Python Dependencies

  1. Clone or download this repository

    cd assignment
  2. Create a virtual environment (recommended)

    python -m venv pdf_parser_env
    
    # Activate the virtual environment
    # On Windows:
    pdf_parser_env\Scripts\activate
    
    # On macOS/Linux:
    source pdf_parser_env/bin/activate
  3. Install required packages

    pip install -r requirements.txt

Troubleshooting Installation

If you encounter issues with specific packages:

  • Camelot installation issues:

    pip install camelot-py[cv] --no-deps
    pip install opencv-python pandas numpy
  • PyMuPDF installation issues:

    pip install --upgrade PyMuPDF
  • Java dependency for tabula-py:

    • Ensure Java is installed on your system
    • On Windows: Download from Oracle or use choco install openjdk
    • On macOS: brew install openjdk
    • On Ubuntu: sudo apt-get install default-jdk

Usage

Basic Usage

python pdf_parser.py <path_to_pdf_file>

Advanced Usage

# Specify output file
python pdf_parser.py input.pdf -o output.json

# Enable verbose logging
python pdf_parser.py input.pdf -v

# Full example
python pdf_parser.py fund_factsheet.pdf -o extracted_fund_data.json -v

Command Line Arguments

  • pdf_file (required): Path to the PDF file to parse
  • -o, --output: Output JSON file path (default: extracted_content.json)
  • -v, --verbose: Enable verbose logging for debugging

Using as a Python Module

from pdf_parser import PDFParser

# Parse PDF and extract content
with PDFParser("path/to/your/file.pdf") as parser:
    extracted_data = parser.parse_pdf()
    parser.save_json(extracted_data, "output.json")

Output Format

The program generates a JSON file with the following structure:

{
  "document_info": {
    "filename": "example.pdf",
    "total_pages": 5
  },
  "pages": [
    {
      "page_number": 1,
      "content": [
        {
          "type": "paragraph",
          "section": "Introduction",
          "sub_section": "Background",
          "text": "This is an example paragraph extracted from the PDF..."
        },
        {
          "type": "table",
          "section": "Financial Data",
          "sub_section": null,
          "description": "Table with 3 rows and 3 columns",
          "table_data": [
            ["Year", "Revenue", "Profit"],
            ["2022", "$10M", "$2M"],
            ["2023", "$12M", "$3M"]
          ]
        },
        {
          "type": "chart",
          "section": "Performance Overview",
          "sub_section": null,
          "description": "Chart/Image found on page 1",
          "table_data": null
        }
      ]
    }
  ]
}

Content Types

  1. Paragraph: Text content with section hierarchy

    • type: "paragraph"
    • text: Extracted text content
    • section: Main section name
    • sub_section: Sub-section name (if applicable)
  2. Table: Tabular data extracted from the PDF

    • type: "table"
    • description: Brief description of the table
    • table_data: 2D array representing table rows and columns
    • section/sub_section: Inferred from nearby content
  3. Chart: Images and charts detected in the PDF

    • type: "chart"
    • description: Description of the detected chart/image
    • table_data: Chart data (if extractable, otherwise null)
    • section/sub_section: Inferred from nearby content

Example

Process the provided fund factsheet:

python pdf_parser.py fund_factsheet.pdf -o fund_analysis.json -v

This will:

  1. Parse the fund factsheet PDF
  2. Extract all text, tables, and charts
  3. Organize content hierarchically by sections
  4. Save the structured data to fund_analysis.json
  5. Display verbose progress information

Architecture

The program uses a multi-library approach for maximum accuracy:

  • PyMuPDF (fitz): Fast text extraction and document structure analysis
  • pdfplumber: Detailed table detection and extraction
  • Camelot: Advanced table extraction with computer vision
  • tabula-py: Alternative table extraction method

Key Components

  1. PDFParser Class: Main parser with context manager support
  2. Content Extractors: Specialized methods for different content types
  3. Section Detection: Automatic identification of document hierarchy
  4. Content Merging: Intelligent combination of different content types

Limitations

  • Chart data extraction is limited to detection; actual data extraction from charts requires additional OCR/ML techniques
  • Complex nested tables may need manual verification
  • Performance depends on PDF complexity and size
  • Some PDFs with unusual formatting may require custom handling

Contributing

To extend the functionality:

  1. Add new content extractors in the PDFParser class
  2. Enhance section detection algorithms
  3. Improve chart data extraction capabilities
  4. Add support for additional PDF formats

Dependencies

See requirements.txt for the complete list of dependencies. Key libraries:

  • PyMuPDF (fitz): Core PDF processing
  • pdfplumber: Table extraction
  • camelot-py: Advanced table detection
  • pandas: Data manipulation
  • opencv-python: Image processing support

License

This project is provided as-is for educational and development purposes.

Support

For issues or questions:

  1. Check that all dependencies are properly installed
  2. Verify that your PDF is not password-protected or corrupted
  3. Try with verbose logging (-v) to identify specific issues
  4. Ensure you have sufficient system resources for large PDF files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages