paperflight

Preflight checks for document pipelines: validate, render, and screen PDFs entirely in memory.

What This Does

Three independent modules, all in-memory, distributed as a single pure-Python wheel:

paperflight.integrity -- check whether a PDF is parseable, encrypted, truncated, or empty.
paperflight.render -- convert a PDF to image bytes (webp, png, jpeg, or raw pixel buffer).
paperflight.blank -- classify an image as blank or containing content using a 4-signal pure-NumPy cascade.

The three modules can be used independently or chained together. Nothing is written to disk and no native binaries are required.

Tech Stack

Layer	Technology
PDF parsing	pypdf
PDF rendering	pypdfium2
Image I/O	Pillow
Numerics	NumPy
Build / packaging	hatchling
Tests	pytest

Runtime dependencies: 4. No native binaries. Pure-Python wheel.

Install

Requires Python 3.10 or newer.

From the GitHub Release (recommended)

Install the wheel directly from the release URL -- no download step:

pip install https://github.com/sherozshaikh/paperflight/releases/download/v1.0.0/paperflight-1.0.0-py3-none-any.whl

Pin the same URL in requirements.txt (or pyproject.toml PEP 508 dependencies) for reproducible installs:

paperflight @ https://github.com/sherozshaikh/paperflight/releases/download/v1.0.0/paperflight-1.0.0-py3-none-any.whl

From a downloaded wheel

Download paperflight-1.0.0-py3-none-any.whl from the Releases page and install it locally:

pip install ./paperflight-1.0.0-py3-none-any.whl

From the git tag (no wheel needed)

pip will build from source against hatchling:

pip install git+https://github.com/sherozshaikh/paperflight.git@v1.0.0

From source (development)

git clone https://github.com/sherozshaikh/paperflight.git
cd paperflight
make setup
make install

After installing by any of the methods above, all three modules import the same way:

from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank
import paperflight; print(paperflight.version)

Quick Start

from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank

with open("invoice.pdf", "rb") as f:
    pdf_bytes = f.read()

integrity = check_pdf(pdf_bytes)
if integrity.is_valid:
    rendered = convert_pdf(pdf_bytes, dpi=200, format="webp")
    for page in rendered.pages:
        if not detect_blank(page.data).is_blank:
            process(page.data)

A runnable, annotated version lives in example.py.

API Contract

1. `paperflight.integrity.check_pdf`

from paperflight.integrity import check_pdf, PDFIntegrityResult, ErrorCode

result = check_pdf(pdf_bytes)

Input

Param	Type	Required	Default	Notes
`pdf_bytes`	`bytes` / `bytearray` / `memoryview`	yes	--	The raw PDF byte buffer.

Output: PDFIntegrityResult (frozen dataclass, all fields immutable)

Field	Type	Description
`is_valid`	`bool`	`True` if the PDF can be passed to `convert_pdf`.
`is_encrypted`	`bool`	`True` if the PDF declares an encryption dictionary.
`needs_password`	`bool`	`True` if encrypted and an empty password did not unlock it.
`page_count`	`int \| None`	Page count, or `None` if unreadable.
`error_code`	`ErrorCode`	One of the 8 enum values below.
`reason`	`str`	Human-readable one-line explanation.

ErrorCode values

Value	When
`OK`	Parsed cleanly, has at least one page.
`EMPTY`	Input was 0 bytes.
`NOT_PDF`	Missing `%PDF-` magic header.
`TRUNCATED`	`%%EOF` marker not found in the last 1024 bytes.
`MALFORMED`	pypdf opened the file but the page tree is unreadable.
`ENCRYPTED`	Encrypted, unlocked with an empty password (still usable).
`NEEDS_PASSWORD`	Encrypted, empty password failed.
`NO_PAGES`	Parsed but contains zero pages.

Exceptions

Exception	When
`TypeError`	`pdf_bytes` is not a bytes-like object.

No other exceptions are raised. Every "bad PDF" outcome is returned as a result, not raised.

2. `paperflight.render.convert_pdf`

from paperflight.render import convert_pdf, ConversionResult, PageImage

result = convert_pdf(
    pdf_bytes_or_path,
    dpi=200,
    pages=None,
    format="webp",
    grayscale=False,
)

Input

Param	Type	Default	Notes
`source`	`bytes` / `bytearray` / `memoryview` / `str` / `Path`	--	The PDF, in memory or on disk.
`dpi`	`int`	`200`	Render resolution. 72 = thumbnail, 200 = OCR-quality, 300 = print-quality.
`pages`	`list[int] \| None`	`None`	Zero-indexed page numbers to render in the requested order. `None` renders every page.
`format`	`"webp" \| "png" \| "jpeg" \| "raw"`	`"webp"`	Output encoding. `webp` is lossless. `raw` returns an uncompressed pixel buffer.
`grayscale`	`bool`	`False`	If `True`, returns single-channel grayscale images (PIL mode `L`).

Output: ConversionResult (frozen dataclass)

Field	Type	Description
`page_count`	`int`	Total page count of the source document, even if `pages` was a subset.
`pages`	`tuple[PageImage, ...]`	Rendered pages in the requested order.

PageImage (frozen dataclass)

Field	Type	Description
`index`	`int`	Original 0-indexed page number in the source document.
`width`	`int`	Pixel width of the rendered image.
`height`	`int`	Pixel height of the rendered image.
`format`	`str`	One of `"webp"`, `"png"`, `"jpeg"`, `"raw"`.
`mode`	`str`	`"RGB"` or `"L"`.
`data`	`bytes`	Encoded image bytes for `webp`/`png`/`jpeg`, or a row-major raw pixel buffer for `"raw"`.

Exceptions

Exception	When
`EncryptedPDFError`	The PDF is password-protected.
`InvalidPDFError`	The byte buffer cannot be parsed as a PDF, or the path is unreadable.
`PageOutOfRangeError`	`pages` references an index outside the document's range.
`TypeError`	`source` is not a bytes-like or path, or `pages` contains a non-int.
`ValueError`	`format` is not one of the four supported values, or `dpi` is non-positive.

All four *PDFError exceptions inherit from PDFRenderError, which can be caught as a single base.

Threading model

Pages are rendered sequentially in the calling thread. The function never spawns processes or threads, so it is safe to call from any caller -- sync scripts, async event loops, multi-threaded services, or container workloads. Behaviour is fully deterministic regardless of the caller's process state.

3. `paperflight.blank.detect_blank`

from paperflight.blank import detect_blank, DetectionResult, DetectionConfig

result = detect_blank(image_bytes)

Input

Param	Type	Default	Notes
`image_input`	`bytes` / `bytearray` / `memoryview`	--	Any image format Pillow can decode (PNG, WEBP, JPEG, BMP, TIFF, ...).
`config`	`DetectionConfig`	`DEFAULT_CONFIG`	Optional override of the cascade thresholds.

Output: DetectionResult (frozen dataclass)

Field	Type	Description
`is_blank`	`bool`	Final verdict.
`has_content`	`bool`	Convenience property; `not is_blank`.
`confidence`	`float`	0.00 to 1.00, rounded to 2 decimals.
`signals`	`tuple[DetectionSignal, ...]`	Ordered list of signals that fired (1, 2, 3, or 4 entries due to cascade short-circuiting).
`reasoning`	`str`	One-line plain-English explanation.

DetectionConfig (frozen dataclass, all defaults calibrated for invoice PDFs at 200 DPI)

Field	Default	Controls
`std_threshold`	`5.0`	Std-dev below this -> blank by Signal 1.
`content_pixel_ratio_threshold`	`0.0001`	Non-background pixel ratio below this -> blank by Signal 2.
`edge_density_threshold`	`2.0`	Mean Sobel magnitude below this -> blank by Signal 3.
`connected_components_threshold`	`5`	Number of dark components below this -> blank by Signal 4.
`background_tolerance`	`10`	Pixels within +/- 10 of the dominant background colour count as background.

Exceptions

Exception	When
`TypeError`	`image_input` is not bytes-like.
`InvalidImageError`	Pillow cannot decode the byte buffer.

Cascade

The classifier short-circuits at each step. The four signals run in order:

Standard deviation of grayscale pixels.
Non-background pixel ratio against the dominant colour.
Edge density via a vectorised integer Sobel operator.
Connected components count via pure-NumPy iterative label propagation. Only fires when signals 1-3 disagree.

99% of clearly blank or clearly content pages are classified after Signal 1 alone.

End-to-End Example

Same as Quick Start, with annotations:

from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank

with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

integrity = check_pdf(pdf_bytes)
if not integrity.is_valid:
    skip(reason=integrity.error_code.value)
else:
    rendered = convert_pdf(pdf_bytes, dpi=200, format="webp")
    for page in rendered.pages:
        verdict = detect_blank(page.data)
        if verdict.has_content:
            forward_to_next_stage(page.data)

example.py in the repository root contains a runnable version that prints the result objects from each call.

Performance

Measured with pytest -m perf on macOS (Apple Silicon). Linux numbers will typically be similar or slightly faster.

Operation	Workload	p95 latency budget
`check_pdf`	1-page PDF	< 10 ms
`check_pdf`	10-page PDF	< 25 ms
`convert_pdf`	single page @ 200 DPI, webp	< 250 ms
`detect_blank`	1700x2200 PNG, full pipeline	< 150 ms
`detect_blank`	1700x2200 WebP, full pipeline	< 150 ms

The blank detector cascade itself is roughly 1 ms when the image is already in memory; the bulk of the budget above is the image decode.

Sequential multi-page rendering at 200 DPI / WebP, measured on macOS (Apple Silicon):

Pages	Total time	Per page
5	~1.0 s	~200 ms
10	~1.7 s	~170 ms
20	~3.4 s	~170 ms
100	~17 s	~170 ms

Project Structure

paperflight/
├── src/
│   └── paperflight/
│       ├── __init__.py
│       ├── integrity/
│       │   ├── core.py            # check_pdf
│       │   └── models.py          # PDFIntegrityResult, ErrorCode
│       ├── render/
│       │   ├── core.py            # convert_pdf
│       │   └── models.py          # ConversionResult, PageImage, exceptions
│       └── blank/
│           ├── core.py            # detect_blank, cascade orchestration
│           ├── signals.py         # 4 cascade signals (pure NumPy)
│           ├── config.py          # DetectionConfig + DEFAULT_CONFIG
│           └── models.py          # DetectionResult, DetectionSignal, InvalidImageError
├── tests/
│   ├── conftest.py
│   ├── fixtures.py                # programmatic PDF generators (reportlab)
│   ├── fixtures/external/         # 13 vendored test PDFs (CC-BY-SA-4.0 + MIT)
│   ├── test_integrity.py          # 29 tests
│   ├── test_render.py             # 44 tests (smoke + edge)
│   ├── test_blank.py              # 35 tests
│   ├── test_perf_budgets.py       # 6 latency budget tests
│   └── test_integration.py        # 23 end-to-end tests on vendored fixtures
├── example.py                     # runnable annotated demo
├── Makefile                       # setup, test, format, build, clean targets
├── pyproject.toml                 # hatchling build, requires-python >= 3.10
├── LICENSE
└── README.md

Makefile Targets

make setup              Create venv with uv (run once)
make install            Install package + dev deps in editable mode
make install-runtime    Install runtime deps only

make test               Run the full test suite
make test-smoke         Run smoke tests only
make test-edge          Run edge case tests only
make test-perf          Run perf budget tests only
make test-integration   Run end-to-end tests on vendored fixtures
make test-fast          Run tests in parallel across CPU cores

make format             isort + black + ruff fix + ruff format
make build              Build the wheel and sdist
make verify-wheel       Build and install the wheel into a fresh venv
make clean              Remove caches and build artifacts
make help               Print this list

Tests

make test

137 tests covering smoke, edge cases, performance budgets, and end-to-end pipeline runs against vendored fixtures. Test suite typically runs in under 60 seconds.

The vendored test PDFs in tests/fixtures/external/ come from py-pdf/sample-files (CC-BY-SA-4.0) and ArturT/Test-PDF-Files (MIT). Licence texts and source mapping are in tests/fixtures/external/.

License

MIT License - see LICENSE file for details.

Author

Sheroz Shaikh - GitHub | LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paperflight

What This Does

Tech Stack

Install

From the GitHub Release (recommended)

From a downloaded wheel

From the git tag (no wheel needed)

From source (development)

Quick Start

API Contract

1. `paperflight.integrity.check_pdf`

2. `paperflight.render.convert_pdf`

3. `paperflight.blank.detect_blank`

End-to-End Example

Performance

Project Structure

Makefile Targets

Tests

License

Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
src/paperflight		src/paperflight
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

paperflight

What This Does

Tech Stack

Install

From the GitHub Release (recommended)

From a downloaded wheel

From the git tag (no wheel needed)

From source (development)

Quick Start

API Contract

1. paperflight.integrity.check_pdf

2. paperflight.render.convert_pdf

3. paperflight.blank.detect_blank

End-to-End Example

Performance

Project Structure

Makefile Targets

Tests

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `paperflight.integrity.check_pdf`

2. `paperflight.render.convert_pdf`

3. `paperflight.blank.detect_blank`

Packages