Skip to content

sherozshaikh/paperflight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

paperflight

Preflight checks for document pipelines: validate, render, and screen PDFs entirely in memory.

Python 3.10 License: MIT


What This Does

Three independent modules, all in-memory, distributed as a single pure-Python wheel:

  • paperflight.integrity -- check whether a PDF is parseable, encrypted, truncated, or empty.
  • paperflight.render -- convert a PDF to image bytes (webp, png, jpeg, or raw pixel buffer).
  • paperflight.blank -- classify an image as blank or containing content using a 4-signal pure-NumPy cascade.

The three modules can be used independently or chained together. Nothing is written to disk and no native binaries are required.


Tech Stack

Layer Technology
PDF parsing pypdf
PDF rendering pypdfium2
Image I/O Pillow
Numerics NumPy
Build / packaging hatchling
Tests pytest

Runtime dependencies: 4. No native binaries. Pure-Python wheel.


Install

Requires Python 3.10 or newer.

From the GitHub Release (recommended)

Install the wheel directly from the release URL -- no download step:

pip install https://github.com/sherozshaikh/paperflight/releases/download/v1.0.0/paperflight-1.0.0-py3-none-any.whl

Pin the same URL in requirements.txt (or pyproject.toml PEP 508 dependencies) for reproducible installs:

paperflight @ https://github.com/sherozshaikh/paperflight/releases/download/v1.0.0/paperflight-1.0.0-py3-none-any.whl

From a downloaded wheel

Download paperflight-1.0.0-py3-none-any.whl from the Releases page and install it locally:

pip install ./paperflight-1.0.0-py3-none-any.whl

From the git tag (no wheel needed)

pip will build from source against hatchling:

pip install git+https://github.com/sherozshaikh/paperflight.git@v1.0.0

From source (development)

git clone https://github.com/sherozshaikh/paperflight.git
cd paperflight
make setup
make install

After installing by any of the methods above, all three modules import the same way:

from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank
import paperflight; print(paperflight.version)

Quick Start

from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank

with open("invoice.pdf", "rb") as f:
    pdf_bytes = f.read()

integrity = check_pdf(pdf_bytes)
if integrity.is_valid:
    rendered = convert_pdf(pdf_bytes, dpi=200, format="webp")
    for page in rendered.pages:
        if not detect_blank(page.data).is_blank:
            process(page.data)

A runnable, annotated version lives in example.py.


API Contract

1. paperflight.integrity.check_pdf

from paperflight.integrity import check_pdf, PDFIntegrityResult, ErrorCode

result = check_pdf(pdf_bytes)

Input

Param Type Required Default Notes
pdf_bytes bytes / bytearray / memoryview yes -- The raw PDF byte buffer.

Output: PDFIntegrityResult (frozen dataclass, all fields immutable)

Field Type Description
is_valid bool True if the PDF can be passed to convert_pdf.
is_encrypted bool True if the PDF declares an encryption dictionary.
needs_password bool True if encrypted and an empty password did not unlock it.
page_count int | None Page count, or None if unreadable.
error_code ErrorCode One of the 8 enum values below.
reason str Human-readable one-line explanation.

ErrorCode values

Value When
OK Parsed cleanly, has at least one page.
EMPTY Input was 0 bytes.
NOT_PDF Missing %PDF- magic header.
TRUNCATED %%EOF marker not found in the last 1024 bytes.
MALFORMED pypdf opened the file but the page tree is unreadable.
ENCRYPTED Encrypted, unlocked with an empty password (still usable).
NEEDS_PASSWORD Encrypted, empty password failed.
NO_PAGES Parsed but contains zero pages.

Exceptions

Exception When
TypeError pdf_bytes is not a bytes-like object.

No other exceptions are raised. Every "bad PDF" outcome is returned as a result, not raised.


2. paperflight.render.convert_pdf

from paperflight.render import convert_pdf, ConversionResult, PageImage

result = convert_pdf(
    pdf_bytes_or_path,
    dpi=200,
    pages=None,
    format="webp",
    grayscale=False,
)

Input

Param Type Default Notes
source bytes / bytearray / memoryview / str / Path -- The PDF, in memory or on disk.
dpi int 200 Render resolution. 72 = thumbnail, 200 = OCR-quality, 300 = print-quality.
pages list[int] | None None Zero-indexed page numbers to render in the requested order. None renders every page.
format "webp" | "png" | "jpeg" | "raw" "webp" Output encoding. webp is lossless. raw returns an uncompressed pixel buffer.
grayscale bool False If True, returns single-channel grayscale images (PIL mode L).

Output: ConversionResult (frozen dataclass)

Field Type Description
page_count int Total page count of the source document, even if pages was a subset.
pages tuple[PageImage, ...] Rendered pages in the requested order.

PageImage (frozen dataclass)

Field Type Description
index int Original 0-indexed page number in the source document.
width int Pixel width of the rendered image.
height int Pixel height of the rendered image.
format str One of "webp", "png", "jpeg", "raw".
mode str "RGB" or "L".
data bytes Encoded image bytes for webp/png/jpeg, or a row-major raw pixel buffer for "raw".

Exceptions

Exception When
EncryptedPDFError The PDF is password-protected.
InvalidPDFError The byte buffer cannot be parsed as a PDF, or the path is unreadable.
PageOutOfRangeError pages references an index outside the document's range.
TypeError source is not a bytes-like or path, or pages contains a non-int.
ValueError format is not one of the four supported values, or dpi is non-positive.

All four *PDFError exceptions inherit from PDFRenderError, which can be caught as a single base.

Threading model

Pages are rendered sequentially in the calling thread. The function never spawns processes or threads, so it is safe to call from any caller -- sync scripts, async event loops, multi-threaded services, or container workloads. Behaviour is fully deterministic regardless of the caller's process state.


3. paperflight.blank.detect_blank

from paperflight.blank import detect_blank, DetectionResult, DetectionConfig

result = detect_blank(image_bytes)

Input

Param Type Default Notes
image_input bytes / bytearray / memoryview -- Any image format Pillow can decode (PNG, WEBP, JPEG, BMP, TIFF, ...).
config DetectionConfig DEFAULT_CONFIG Optional override of the cascade thresholds.

Output: DetectionResult (frozen dataclass)

Field Type Description
is_blank bool Final verdict.
has_content bool Convenience property; not is_blank.
confidence float 0.00 to 1.00, rounded to 2 decimals.
signals tuple[DetectionSignal, ...] Ordered list of signals that fired (1, 2, 3, or 4 entries due to cascade short-circuiting).
reasoning str One-line plain-English explanation.

DetectionConfig (frozen dataclass, all defaults calibrated for invoice PDFs at 200 DPI)

Field Default Controls
std_threshold 5.0 Std-dev below this -> blank by Signal 1.
content_pixel_ratio_threshold 0.0001 Non-background pixel ratio below this -> blank by Signal 2.
edge_density_threshold 2.0 Mean Sobel magnitude below this -> blank by Signal 3.
connected_components_threshold 5 Number of dark components below this -> blank by Signal 4.
background_tolerance 10 Pixels within +/- 10 of the dominant background colour count as background.

Exceptions

Exception When
TypeError image_input is not bytes-like.
InvalidImageError Pillow cannot decode the byte buffer.

Cascade

The classifier short-circuits at each step. The four signals run in order:

  1. Standard deviation of grayscale pixels.
  2. Non-background pixel ratio against the dominant colour.
  3. Edge density via a vectorised integer Sobel operator.
  4. Connected components count via pure-NumPy iterative label propagation. Only fires when signals 1-3 disagree.

99% of clearly blank or clearly content pages are classified after Signal 1 alone.


End-to-End Example

Same as Quick Start, with annotations:

from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank

with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

integrity = check_pdf(pdf_bytes)
if not integrity.is_valid:
    skip(reason=integrity.error_code.value)
else:
    rendered = convert_pdf(pdf_bytes, dpi=200, format="webp")
    for page in rendered.pages:
        verdict = detect_blank(page.data)
        if verdict.has_content:
            forward_to_next_stage(page.data)

example.py in the repository root contains a runnable version that prints the result objects from each call.


Performance

Measured with pytest -m perf on macOS (Apple Silicon). Linux numbers will typically be similar or slightly faster.

Operation Workload p95 latency budget
check_pdf 1-page PDF < 10 ms
check_pdf 10-page PDF < 25 ms
convert_pdf single page @ 200 DPI, webp < 250 ms
detect_blank 1700x2200 PNG, full pipeline < 150 ms
detect_blank 1700x2200 WebP, full pipeline < 150 ms

The blank detector cascade itself is roughly 1 ms when the image is already in memory; the bulk of the budget above is the image decode.

Sequential multi-page rendering at 200 DPI / WebP, measured on macOS (Apple Silicon):

Pages Total time Per page
5 ~1.0 s ~200 ms
10 ~1.7 s ~170 ms
20 ~3.4 s ~170 ms
100 ~17 s ~170 ms

Project Structure

paperflight/
├── src/
│   └── paperflight/
│       ├── __init__.py
│       ├── integrity/
│       │   ├── core.py            # check_pdf
│       │   └── models.py          # PDFIntegrityResult, ErrorCode
│       ├── render/
│       │   ├── core.py            # convert_pdf
│       │   └── models.py          # ConversionResult, PageImage, exceptions
│       └── blank/
│           ├── core.py            # detect_blank, cascade orchestration
│           ├── signals.py         # 4 cascade signals (pure NumPy)
│           ├── config.py          # DetectionConfig + DEFAULT_CONFIG
│           └── models.py          # DetectionResult, DetectionSignal, InvalidImageError
├── tests/
│   ├── conftest.py
│   ├── fixtures.py                # programmatic PDF generators (reportlab)
│   ├── fixtures/external/         # 13 vendored test PDFs (CC-BY-SA-4.0 + MIT)
│   ├── test_integrity.py          # 29 tests
│   ├── test_render.py             # 44 tests (smoke + edge)
│   ├── test_blank.py              # 35 tests
│   ├── test_perf_budgets.py       # 6 latency budget tests
│   └── test_integration.py        # 23 end-to-end tests on vendored fixtures
├── example.py                     # runnable annotated demo
├── Makefile                       # setup, test, format, build, clean targets
├── pyproject.toml                 # hatchling build, requires-python >= 3.10
├── LICENSE
└── README.md

Makefile Targets

make setup              Create venv with uv (run once)
make install            Install package + dev deps in editable mode
make install-runtime    Install runtime deps only

make test               Run the full test suite
make test-smoke         Run smoke tests only
make test-edge          Run edge case tests only
make test-perf          Run perf budget tests only
make test-integration   Run end-to-end tests on vendored fixtures
make test-fast          Run tests in parallel across CPU cores

make format             isort + black + ruff fix + ruff format
make build              Build the wheel and sdist
make verify-wheel       Build and install the wheel into a fresh venv
make clean              Remove caches and build artifacts
make help               Print this list

Tests

make test

137 tests covering smoke, edge cases, performance budgets, and end-to-end pipeline runs against vendored fixtures. Test suite typically runs in under 60 seconds.

The vendored test PDFs in tests/fixtures/external/ come from py-pdf/sample-files (CC-BY-SA-4.0) and ArturT/Test-PDF-Files (MIT). Licence texts and source mapping are in tests/fixtures/external/.


License

MIT License - see LICENSE file for details.


Author

Sheroz Shaikh - GitHub | LinkedIn

About

Preflight checks for document extraction pipelines — validate, render, and screen PDFs before they reach your LLM. Pure-Python wheel, in-memory only.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors