PDF Data Processing Architecture

Overview

OpenContracts uses a modern, pluggable document processing pipeline that has evolved from the original PAWLs/Grobid approach to support multiple advanced parsing backends. The system now leverages state-of-the-art machine learning models while maintaining backward compatibility with the PAWLs data format.

Current Architecture

Parser Pipeline System

OpenContracts implements a modular pipeline architecture with four main parser options:

Docling Parser (Primary) - IBM's advanced ML-based parser running as a REST microservice
- Source: opencontractserver/pipeline/parsers/docling_parser_rest.py
- Superior layout understanding and table extraction
- Intelligent OCR with automatic detection
- Hierarchical document structure extraction
- Group relationship detection for contract clauses
LlamaParse Parser - Cloud-based parser using LlamaIndex API
- Source: opencontractserver/pipeline/parsers/llamaparse_parser.py
- High-quality layout extraction
- Automatic OCR support
- Good for complex document structures
Text Parser - Simple parser for plain text and markdown files
- Source: opencontractserver/pipeline/parsers/oc_text_parser.py
- Direct text extraction
- Minimal processing overhead
- Preserves original formatting

Data Layers

OpenContracts maintains a multi-layered data architecture that provides a consistent interface regardless of the parsing backend used:

1. PAWLs Layer (JSON)

The PAWLs (PDF Annotation With Labels) layer remains the core data format, storing:

Individual tokens (words) with precise bounding box coordinates
Page dimensions and layout information
Token-level positional data enabling pixel-perfect annotation overlay
Hierarchical structure information (headers, paragraphs, lists)

{
  "pawls_file_content": [
    {
      "page": {"width": 612, "height": 792, "index": 0},
      "tokens": [
        {
          "text": "Contract",
          "bbox": {"x": 100, "y": 50, "width": 80, "height": 15}
        }
      ]
    }
  ]
}

2. Text Layer

A pure text extraction built from the PAWLs layer that:

Preserves reading order
Maintains paragraph and section boundaries
Enables full-text search and NLP processing
Provides character-level position mapping back to PAWLs tokens

3. Annotation Layer

Structural and semantic annotations including:

Document structure (headers, sections, paragraphs)
Detected entities and labels
User-created annotations
Relationships between document elements

4. Relationship Layer (New)

Advanced parsers like Docling can detect relationships between document elements:

Parent-child hierarchies (section → subsection)
Cross-references between clauses
Grouped elements (related paragraphs, list items)
Table cell relationships

Processing Pipeline

graph LR
    A[PDF Upload] --> B{Parser Selection}
    B --> C[Docling REST API]
    B --> D[LlamaParse API]
    B --> E[Text Parser]
    B --> P[LlamaParse API]

    C --> F[PAWLs Generation]
    D --> F
    E --> F
    P --> F

    F --> G[Text Extraction]
    F --> H[Annotation Creation]
    F --> I[Relationship Mapping]

    G --> J[Searchable Text Layer]
    H --> K[Visual Annotations]
    I --> L[Document Graph]

Evolution from Original PAWLs

The original OpenContracts implementation used:

Grobid for layout analysis
Tesseract for OCR
Re-OCR of every document for consistency

The current system has evolved to:

Use modern ML models for better accuracy
Support multiple parsing backends
Preserve embedded text when appropriate
Only apply OCR when needed (configurable)
Extract richer structural information

Key Improvements

Better Accuracy: ML-based parsers provide superior layout understanding
Flexibility: Choose the right parser for your document types
Performance: Microservice architecture enables better scaling
Rich Structure: Extract hierarchies and relationships, not just text
Selective OCR: Only OCR when needed, preserving original text quality

Maintaining Compatibility

Despite the architectural evolution, OpenContracts maintains full compatibility:

PAWLs format remains the standard interface
All parsers output to the same data structure
Existing annotations and tools continue to work
Text-to-position mapping preserved

Configuration

Parsers are configured in Django settings. See the base settings file for current defaults.

Available Parser Classes:

opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser - Primary ML-based parser
opencontractserver.pipeline.parsers.oc_text_parser.TxtParser - Plain text parser
opencontractserver.pipeline.parsers.llamaparse_parser.LlamaParseParser - LlamaIndex cloud parser

Example Configuration:

PREFERRED_PARSERS = {
    "application/pdf": "opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser",
    "text/plain": "opencontractserver.pipeline.parsers.oc_text_parser.TxtParser",
}

# Parser-specific settings
DOCLING_PARSER_SERVICE_URL = "http://docling-parser:8000/parse/"
DOCLING_PARSER_TIMEOUT = 300

Limitations and Trade-offs

Current Limitations

OCR Quality: While improved, OCR can still make errors (O vs 0, I vs 1)
Processing Time: ML-based parsers are slower than simple text extraction
Resource Usage: Advanced parsers require more memory and CPU
Format Support: Currently limited to PDF and text formats

Design Trade-offs

Accuracy vs Speed: ML parsers are more accurate but slower
Flexibility vs Complexity: Multiple parsers add configuration complexity
Consistency vs Fidelity: Standardizing to PAWLs format may lose some format-specific features

Future Directions

Additional Format Support: DOCX, XLSX, HTML parsing
Streaming Processing: Handle very large documents efficiently
Custom Parser Plugins: Easy integration of domain-specific parsers
Enhanced Relationships: More sophisticated document graph analysis
Hybrid Processing: Combine multiple parsers for optimal results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF Data Processing Architecture

Overview

Current Architecture

Parser Pipeline System

Data Layers

1. PAWLs Layer (JSON)

2. Text Layer

3. Annotation Layer

4. Relationship Layer (New)

Processing Pipeline

Evolution from Original PAWLs

Key Improvements

Maintaining Compatibility

Configuration

Limitations and Trade-offs

Current Limitations

Design Trade-offs

Future Directions

See Also

Uh oh!

FilesExpand file tree

PDF-data-layer.md

Latest commit

History

PDF-data-layer.md

File metadata and controls

PDF Data Processing Architecture

Overview

Current Architecture

Parser Pipeline System

Data Layers

1. PAWLs Layer (JSON)

2. Text Layer

3. Annotation Layer

4. Relationship Layer (New)

Processing Pipeline

Evolution from Original PAWLs

Key Improvements

Maintaining Compatibility

Configuration

Limitations and Trade-offs

Current Limitations

Design Trade-offs

Future Directions

See Also