High-Performance Asynchronous Document Processing Engine
Transform unstructured PDFs and images into structured, queryable data with GPU-accelerated OCR
Quick Start β’ Documentation β’ Architecture β’ API Reference β’ Contributing
A production-grade asynchronous OCR processing engine built for enterprise-scale document intelligence. The system combines FastAPI's high-performance web framework with Celery's distributed task processing, powered by PaddleOCR's PP-StructureV3 pipeline for state-of-the-art accuracy.
This is exactly what the service does: takes a raw document page and returns OCR detections (text + bounding boxes + confidence), so you can visualize and extract structured information.
Left: original page. Right: OCR detections overlaid as text regions.
Core Capabilities:
- Asynchronous Processing: Non-blocking API with real-time status tracking
- GPU Acceleration: Optimized for NVIDIA CUDA environments
- Multi-format Support: PDFs and image formats (PNG, JPG, TIFF)
- Enterprise Architecture: Scalable microservices with Redis message brokering
- Production Ready: Containerized deployment with comprehensive error handling
Technical Stack:
- Backend: FastAPI with Pydantic validation
- Task Queue: Celery with Redis broker
- OCR Engine: PaddleOCR PP-StructureV3 pipeline
- Containerization: Docker Compose with GPU support
- AI Models: 13 specialized models for layout detection, text recognition, and table extraction
The system implements a microservices architecture optimized for high-throughput document processing:
graph TB
subgraph "Client Layer"
A[Client Application]
B[Python Requests]
C[cURL/HTTP]
end
subgraph "API Gateway"
D[FastAPI Server<br/>Port 8000]
end
subgraph "Message Broker"
E[Redis<br/>Port 6379]
end
subgraph "Processing Layer"
F[Celery Worker<br/>GPU-Enabled]
G[OCR Service<br/>PP-StructureV3]
end
subgraph "AI Models"
H[Layout Detection<br/>PP-DocLayout_plus-L]
I[Text Detection<br/>PP-OCRv5_server_det]
J[Text Recognition<br/>en_PP-OCRv4_mobile_rec]
K[Table Recognition<br/>SLANeXt + RT-DETR-L]
end
A --> D
B --> D
C --> D
D --> E
E --> F
F --> G
G --> H
G --> I
G --> J
G --> K
style D fill:#e1f5fe
style E fill:#fff3e0
style F fill:#f3e5f5
style G fill:#e8f5e8
Component Responsibilities:
| Component | Role | Technology | Scaling |
|---|---|---|---|
| FastAPI Server | Request handling, task orchestration | FastAPI + Uvicorn | Horizontal |
| Redis Broker | Message queuing, result storage | Redis 7 | Cluster-ready |
| Celery Worker | GPU-intensive OCR processing | Celery + PaddleOCR | Vertical (GPU) |
| OCR Pipeline | Document analysis and text extraction | PP-StructureV3 | Model-parallel |
- Docker + Docker Compose
- NVIDIA GPU with CUDA drivers and NVIDIA Container Toolkit
- Recommended: 4GB+ GPU VRAM, 16GB+ system RAM
git clone https://github.com/your-username/ocr-inference-gpu.git
cd ocr-inference-gpu
cp .env.example .envREDIS_URL is already configured for Docker in .env.example:
REDIS_URL="redis://redis:6379/0"docker-compose up --buildServices:
- API:
http://localhost:8000 - OpenAPI docs:
http://localhost:8000/docs - Redis broker:
localhost:6379
On first run, Paddle models are downloaded. Watch worker logs until OCR service is loaded:
docker-compose logs -f workercurl -X POST "http://localhost:8000/api/v1/ocr/process" \
-F "file=@dataset/1page.pdf"Copy the returned task_id, then:
curl "http://localhost:8000/api/v1/ocr/results/<task_id>"Repeat the results call until status becomes SUCCESS.
sequenceDiagram
participant C as Client
participant API as FastAPI
participant R as Redis
participant W as Worker
participant OCR as OCR Engine
C->>API: POST /api/v1/ocr/process
API->>R: Queue task
API->>C: 202 {task_id}
loop Polling
C->>API: GET /api/v1/ocr/results/{task_id}
API->>R: Check status
API->>C: Status response
end
R->>W: Dequeue task
W->>OCR: Process document
OCR->>W: Return detections
W->>R: Store results
C->>API: GET /api/v1/ocr/results/{task_id}
API->>R: Fetch results
API->>C: 200 {detections}
POST /api/v1/ocr/process
Content-Type: multipart/form-data
file=<PDF or image>Returns 202 Accepted:
{
"task_id": "a0cbcc44-7857-45a9-b6d2-f0cf91b81cce",
"status": "pending",
"message": "OCR task queued successfully."
}GET /api/v1/ocr/results/{task_id}Returns task state plus OCR output when done:
{
"task_id": "string",
"status": "SUCCESS",
"result": {
"detections": [
{
"text": "Tesla, Inc.",
"box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]],
"confidence": 0.99,
"page_number": 1
}
]
},
"pending_tasks": 0
}import requests
import time
BASE_URL = "http://localhost:8000"
# 1) Submit document
with open("dataset/1page.pdf", "rb") as f:
response = requests.post(
f"{BASE_URL}/api/v1/ocr/process",
files={"file": ("1page.pdf", f, "application/pdf")}
)
task_id = response.json()["task_id"]
# 2) Poll until complete
while True:
result = requests.get(f"{BASE_URL}/api/v1/ocr/results/{task_id}")
data = result.json()
if data["status"] == "SUCCESS":
detections = data["result"]["detections"]
print(f"Extracted {len(detections)} text detections")
print(detections[:3]) # first few detections
break
elif data["status"] == "FAILURE":
print(f"Processing failed: {data.get('result', {}).get('error', 'Unknown error')}")
break
time.sleep(2)To render boxes in the correct place, use the same pixel space as OCR:
boxuses image pixel coordinates in[x, y]order- Origin is top-left:
(0, 0)is top-left of the OCR input image page_numberis 1-based for PDFs- For PDFs in this service, each page is rendered at
fitz.Matrix(2, 2)before OCR, so boxes align to the rendered page image (2x scale of PDF points)
If your UI displays a resized image/canvas, scale bbox coordinates:
scale_x = displayed_width / original_image_width
scale_y = displayed_height / original_image_height
display_x = original_x * scale_x
display_y = original_y * scale_y
Use polygons (not only axis-aligned rectangles) so rotated/skewed text still aligns:
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from PIL import Image
def visualize_detections(image_path, detections, page_number=1):
img = Image.open(image_path).convert("RGB")
fig, ax = plt.subplots(figsize=(12, 16))
ax.imshow(img)
page_detections = [d for d in detections if d["page_number"] == page_number]
for detection in page_detections:
points = detection["box"] # [[x1,y1],[x2,y2],[x3,y3],[x4,y4]]
# Draw exact OCR polygon
poly = Polygon(points, closed=True, fill=False, edgecolor="red", linewidth=1.5)
ax.add_patch(poly)
# Label near first point
x0, y0 = points[0]
ax.text(
x0,
max(0, y0 - 3),
detection["text"][:50],
color="red",
fontsize=7,
backgroundcolor="white",
)
ax.set_title(f"Page {page_number} - {len(page_detections)} detections")
ax.axis("off")
plt.tight_layout()
plt.show()- Call OCR and get
detections - Render overlay on the exact same source image used for OCR
- Confirm words and polygons line up visually
- If they drift, verify:
- image was resized after OCR (apply
scale_x,scale_y) - wrong PDF page image scale was used
- coordinates were rounded/truncated too early
- image was resized after OCR (apply
The OCR pipeline is configured via src/configs/pipelines/PP-StructureV3.yaml:
pipeline_name: PP-StructureV3
batch_size: 4
use_doc_preprocessor: True
use_table_recognition: True
SubModules:
LayoutDetection:
model_name: PP-DocLayout_plus-L
threshold:
0: 0.3 # Text regions
1: 0.5 # Titles
2: 0.4 # Lists
# ... additional classes| Model | Purpose | Size | Precision |
|---|---|---|---|
PP-DocLayout_plus-L |
Layout detection and segmentation | ~200MB | High |
PP-OCRv5_server_det |
Text line detection | ~180MB | High |
en_PP-OCRv4_mobile_rec |
English text recognition | ~25MB | Mobile-optimized |
SLANeXt_wired |
Table structure recognition | ~150MB | Enterprise |
RT-DETR-L_*_table_cell_det |
Table cell detection | ~300MB | High precision |
Memory Optimization
# Reduce batch size for lower memory usage
batch_size: 2
# Disable unused modules
use_seal_recognition: False
use_formula_recognition: False
use_chart_recognition: FalseThroughput Optimization
# Increase batch size for higher throughput
batch_size: 8
# Worker concurrency
command: celery -A src.tasks.celery_app worker --concurrency=4Health Checks
# API health
curl http://localhost:8000/health
# Worker status
celery -A src.tasks.celery_app inspect active
# Redis metrics
redis-cli info memoryPerformance Metrics Performance Metrics (RTX 3050 4GB Laptop)
- Throughput: ~20 pages/minute (~3 seconds per page)
- Latency: ~3 seconds per page for 300 DPI input
- Memory: 4GB GPU, ~4-6GB system RAM
Why FastAPI? High-performance async framework with automatic OpenAPI documentation
Why Celery? Proven distributed task queue with robust error handling and retry mechanisms
Why PaddleOCR? State-of-the-art accuracy with production-ready performance and Chinese text support
Why Redis? In-memory performance for task queuing with persistence options

