OpenContracts uses a modern, pluggable document processing pipeline that has evolved from the original PAWLs/Grobid approach to support multiple advanced parsing backends. The system now leverages state-of-the-art machine learning models while maintaining backward compatibility with the PAWLs data format.
OpenContracts implements a modular pipeline architecture with four main parser options:
-
Docling Parser (Primary) - IBM's advanced ML-based parser running as a REST microservice
- Source:
opencontractserver/pipeline/parsers/docling_parser_rest.py - Superior layout understanding and table extraction
- Intelligent OCR with automatic detection
- Hierarchical document structure extraction
- Group relationship detection for contract clauses
- Source:
-
LlamaParse Parser - Cloud-based parser using LlamaIndex API
- Source:
opencontractserver/pipeline/parsers/llamaparse_parser.py - High-quality layout extraction
- Automatic OCR support
- Good for complex document structures
- Source:
-
Text Parser - Simple parser for plain text and markdown files
- Source:
opencontractserver/pipeline/parsers/oc_text_parser.py - Direct text extraction
- Minimal processing overhead
- Preserves original formatting
- Source:
OpenContracts maintains a multi-layered data architecture that provides a consistent interface regardless of the parsing backend used:
The PAWLs (PDF Annotation With Labels) layer remains the core data format, storing:
- Individual tokens (words) with precise bounding box coordinates
- Page dimensions and layout information
- Token-level positional data enabling pixel-perfect annotation overlay
- Hierarchical structure information (headers, paragraphs, lists)
{
"pawls_file_content": [
{
"page": {"width": 612, "height": 792, "index": 0},
"tokens": [
{
"text": "Contract",
"bbox": {"x": 100, "y": 50, "width": 80, "height": 15}
}
]
}
]
}A pure text extraction built from the PAWLs layer that:
- Preserves reading order
- Maintains paragraph and section boundaries
- Enables full-text search and NLP processing
- Provides character-level position mapping back to PAWLs tokens
Structural and semantic annotations including:
- Document structure (headers, sections, paragraphs)
- Detected entities and labels
- User-created annotations
- Relationships between document elements
Advanced parsers like Docling can detect relationships between document elements:
- Parent-child hierarchies (section → subsection)
- Cross-references between clauses
- Grouped elements (related paragraphs, list items)
- Table cell relationships
graph LR
A[PDF Upload] --> B{Parser Selection}
B --> C[Docling REST API]
B --> D[LlamaParse API]
B --> E[Text Parser]
B --> P[LlamaParse API]
C --> F[PAWLs Generation]
D --> F
E --> F
P --> F
F --> G[Text Extraction]
F --> H[Annotation Creation]
F --> I[Relationship Mapping]
G --> J[Searchable Text Layer]
H --> K[Visual Annotations]
I --> L[Document Graph]
The original OpenContracts implementation used:
- Grobid for layout analysis
- Tesseract for OCR
- Re-OCR of every document for consistency
The current system has evolved to:
- Use modern ML models for better accuracy
- Support multiple parsing backends
- Preserve embedded text when appropriate
- Only apply OCR when needed (configurable)
- Extract richer structural information
- Better Accuracy: ML-based parsers provide superior layout understanding
- Flexibility: Choose the right parser for your document types
- Performance: Microservice architecture enables better scaling
- Rich Structure: Extract hierarchies and relationships, not just text
- Selective OCR: Only OCR when needed, preserving original text quality
Despite the architectural evolution, OpenContracts maintains full compatibility:
- PAWLs format remains the standard interface
- All parsers output to the same data structure
- Existing annotations and tools continue to work
- Text-to-position mapping preserved
Parsers are configured in Django settings. See the base settings file for current defaults.
Available Parser Classes:
opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser- Primary ML-based parseropencontractserver.pipeline.parsers.oc_text_parser.TxtParser- Plain text parseropencontractserver.pipeline.parsers.llamaparse_parser.LlamaParseParser- LlamaIndex cloud parser
Example Configuration:
PREFERRED_PARSERS = {
"application/pdf": "opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser",
"text/plain": "opencontractserver.pipeline.parsers.oc_text_parser.TxtParser",
}
# Parser-specific settings
DOCLING_PARSER_SERVICE_URL = "http://docling-parser:8000/parse/"
DOCLING_PARSER_TIMEOUT = 300- OCR Quality: While improved, OCR can still make errors (O vs 0, I vs 1)
- Processing Time: ML-based parsers are slower than simple text extraction
- Resource Usage: Advanced parsers require more memory and CPU
- Format Support: Currently limited to PDF and text formats
- Accuracy vs Speed: ML parsers are more accurate but slower
- Flexibility vs Complexity: Multiple parsers add configuration complexity
- Consistency vs Fidelity: Standardizing to PAWLs format may lose some format-specific features
- Additional Format Support: DOCX, XLSX, HTML parsing
- Streaming Processing: Handle very large documents efficiently
- Custom Parser Plugins: Easy integration of domain-specific parsers
- Enhanced Relationships: More sophisticated document graph analysis
- Hybrid Processing: Combine multiple parsers for optimal results