Preflight checks for document pipelines: validate, render, and screen PDFs entirely in memory.
Three independent modules, all in-memory, distributed as a single pure-Python wheel:
paperflight.integrity-- check whether a PDF is parseable, encrypted, truncated, or empty.paperflight.render-- convert a PDF to image bytes (webp,png,jpeg, or raw pixel buffer).paperflight.blank-- classify an image as blank or containing content using a 4-signal pure-NumPy cascade.
The three modules can be used independently or chained together. Nothing is written to disk and no native binaries are required.
| Layer | Technology |
|---|---|
| PDF parsing | pypdf |
| PDF rendering | pypdfium2 |
| Image I/O | Pillow |
| Numerics | NumPy |
| Build / packaging | hatchling |
| Tests | pytest |
Runtime dependencies: 4. No native binaries. Pure-Python wheel.
Requires Python 3.10 or newer.
Install the wheel directly from the release URL -- no download step:
pip install https://github.com/sherozshaikh/paperflight/releases/download/v1.0.0/paperflight-1.0.0-py3-none-any.whlPin the same URL in requirements.txt (or pyproject.toml PEP 508 dependencies) for reproducible installs:
paperflight @ https://github.com/sherozshaikh/paperflight/releases/download/v1.0.0/paperflight-1.0.0-py3-none-any.whl
Download paperflight-1.0.0-py3-none-any.whl from the Releases page and install it locally:
pip install ./paperflight-1.0.0-py3-none-any.whlpip will build from source against hatchling:
pip install git+https://github.com/sherozshaikh/paperflight.git@v1.0.0git clone https://github.com/sherozshaikh/paperflight.git
cd paperflight
make setup
make installAfter installing by any of the methods above, all three modules import the same way:
from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank
import paperflight; print(paperflight.version)from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank
with open("invoice.pdf", "rb") as f:
pdf_bytes = f.read()
integrity = check_pdf(pdf_bytes)
if integrity.is_valid:
rendered = convert_pdf(pdf_bytes, dpi=200, format="webp")
for page in rendered.pages:
if not detect_blank(page.data).is_blank:
process(page.data)A runnable, annotated version lives in example.py.
from paperflight.integrity import check_pdf, PDFIntegrityResult, ErrorCode
result = check_pdf(pdf_bytes)Input
| Param | Type | Required | Default | Notes |
|---|---|---|---|---|
pdf_bytes |
bytes / bytearray / memoryview |
yes | -- | The raw PDF byte buffer. |
Output: PDFIntegrityResult (frozen dataclass, all fields immutable)
| Field | Type | Description |
|---|---|---|
is_valid |
bool |
True if the PDF can be passed to convert_pdf. |
is_encrypted |
bool |
True if the PDF declares an encryption dictionary. |
needs_password |
bool |
True if encrypted and an empty password did not unlock it. |
page_count |
int | None |
Page count, or None if unreadable. |
error_code |
ErrorCode |
One of the 8 enum values below. |
reason |
str |
Human-readable one-line explanation. |
ErrorCode values
| Value | When |
|---|---|
OK |
Parsed cleanly, has at least one page. |
EMPTY |
Input was 0 bytes. |
NOT_PDF |
Missing %PDF- magic header. |
TRUNCATED |
%%EOF marker not found in the last 1024 bytes. |
MALFORMED |
pypdf opened the file but the page tree is unreadable. |
ENCRYPTED |
Encrypted, unlocked with an empty password (still usable). |
NEEDS_PASSWORD |
Encrypted, empty password failed. |
NO_PAGES |
Parsed but contains zero pages. |
Exceptions
| Exception | When |
|---|---|
TypeError |
pdf_bytes is not a bytes-like object. |
No other exceptions are raised. Every "bad PDF" outcome is returned as a result, not raised.
from paperflight.render import convert_pdf, ConversionResult, PageImage
result = convert_pdf(
pdf_bytes_or_path,
dpi=200,
pages=None,
format="webp",
grayscale=False,
)Input
| Param | Type | Default | Notes |
|---|---|---|---|
source |
bytes / bytearray / memoryview / str / Path |
-- | The PDF, in memory or on disk. |
dpi |
int |
200 |
Render resolution. 72 = thumbnail, 200 = OCR-quality, 300 = print-quality. |
pages |
list[int] | None |
None |
Zero-indexed page numbers to render in the requested order. None renders every page. |
format |
"webp" | "png" | "jpeg" | "raw" |
"webp" |
Output encoding. webp is lossless. raw returns an uncompressed pixel buffer. |
grayscale |
bool |
False |
If True, returns single-channel grayscale images (PIL mode L). |
Output: ConversionResult (frozen dataclass)
| Field | Type | Description |
|---|---|---|
page_count |
int |
Total page count of the source document, even if pages was a subset. |
pages |
tuple[PageImage, ...] |
Rendered pages in the requested order. |
PageImage (frozen dataclass)
| Field | Type | Description |
|---|---|---|
index |
int |
Original 0-indexed page number in the source document. |
width |
int |
Pixel width of the rendered image. |
height |
int |
Pixel height of the rendered image. |
format |
str |
One of "webp", "png", "jpeg", "raw". |
mode |
str |
"RGB" or "L". |
data |
bytes |
Encoded image bytes for webp/png/jpeg, or a row-major raw pixel buffer for "raw". |
Exceptions
| Exception | When |
|---|---|
EncryptedPDFError |
The PDF is password-protected. |
InvalidPDFError |
The byte buffer cannot be parsed as a PDF, or the path is unreadable. |
PageOutOfRangeError |
pages references an index outside the document's range. |
TypeError |
source is not a bytes-like or path, or pages contains a non-int. |
ValueError |
format is not one of the four supported values, or dpi is non-positive. |
All four *PDFError exceptions inherit from PDFRenderError, which can be caught as a single base.
Threading model
Pages are rendered sequentially in the calling thread. The function never spawns processes or threads, so it is safe to call from any caller -- sync scripts, async event loops, multi-threaded services, or container workloads. Behaviour is fully deterministic regardless of the caller's process state.
from paperflight.blank import detect_blank, DetectionResult, DetectionConfig
result = detect_blank(image_bytes)Input
| Param | Type | Default | Notes |
|---|---|---|---|
image_input |
bytes / bytearray / memoryview |
-- | Any image format Pillow can decode (PNG, WEBP, JPEG, BMP, TIFF, ...). |
config |
DetectionConfig |
DEFAULT_CONFIG |
Optional override of the cascade thresholds. |
Output: DetectionResult (frozen dataclass)
| Field | Type | Description |
|---|---|---|
is_blank |
bool |
Final verdict. |
has_content |
bool |
Convenience property; not is_blank. |
confidence |
float |
0.00 to 1.00, rounded to 2 decimals. |
signals |
tuple[DetectionSignal, ...] |
Ordered list of signals that fired (1, 2, 3, or 4 entries due to cascade short-circuiting). |
reasoning |
str |
One-line plain-English explanation. |
DetectionConfig (frozen dataclass, all defaults calibrated for invoice PDFs at 200 DPI)
| Field | Default | Controls |
|---|---|---|
std_threshold |
5.0 |
Std-dev below this -> blank by Signal 1. |
content_pixel_ratio_threshold |
0.0001 |
Non-background pixel ratio below this -> blank by Signal 2. |
edge_density_threshold |
2.0 |
Mean Sobel magnitude below this -> blank by Signal 3. |
connected_components_threshold |
5 |
Number of dark components below this -> blank by Signal 4. |
background_tolerance |
10 |
Pixels within +/- 10 of the dominant background colour count as background. |
Exceptions
| Exception | When |
|---|---|
TypeError |
image_input is not bytes-like. |
InvalidImageError |
Pillow cannot decode the byte buffer. |
Cascade
The classifier short-circuits at each step. The four signals run in order:
- Standard deviation of grayscale pixels.
- Non-background pixel ratio against the dominant colour.
- Edge density via a vectorised integer Sobel operator.
- Connected components count via pure-NumPy iterative label propagation. Only fires when signals 1-3 disagree.
99% of clearly blank or clearly content pages are classified after Signal 1 alone.
Same as Quick Start, with annotations:
from paperflight.integrity import check_pdf
from paperflight.render import convert_pdf
from paperflight.blank import detect_blank
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
integrity = check_pdf(pdf_bytes)
if not integrity.is_valid:
skip(reason=integrity.error_code.value)
else:
rendered = convert_pdf(pdf_bytes, dpi=200, format="webp")
for page in rendered.pages:
verdict = detect_blank(page.data)
if verdict.has_content:
forward_to_next_stage(page.data)example.py in the repository root contains a runnable version that prints the result objects from each call.
Measured with pytest -m perf on macOS (Apple Silicon). Linux numbers will typically be similar or slightly faster.
| Operation | Workload | p95 latency budget |
|---|---|---|
check_pdf |
1-page PDF | < 10 ms |
check_pdf |
10-page PDF | < 25 ms |
convert_pdf |
single page @ 200 DPI, webp | < 250 ms |
detect_blank |
1700x2200 PNG, full pipeline | < 150 ms |
detect_blank |
1700x2200 WebP, full pipeline | < 150 ms |
The blank detector cascade itself is roughly 1 ms when the image is already in memory; the bulk of the budget above is the image decode.
Sequential multi-page rendering at 200 DPI / WebP, measured on macOS (Apple Silicon):
| Pages | Total time | Per page |
|---|---|---|
| 5 | ~1.0 s | ~200 ms |
| 10 | ~1.7 s | ~170 ms |
| 20 | ~3.4 s | ~170 ms |
| 100 | ~17 s | ~170 ms |
paperflight/
├── src/
│ └── paperflight/
│ ├── __init__.py
│ ├── integrity/
│ │ ├── core.py # check_pdf
│ │ └── models.py # PDFIntegrityResult, ErrorCode
│ ├── render/
│ │ ├── core.py # convert_pdf
│ │ └── models.py # ConversionResult, PageImage, exceptions
│ └── blank/
│ ├── core.py # detect_blank, cascade orchestration
│ ├── signals.py # 4 cascade signals (pure NumPy)
│ ├── config.py # DetectionConfig + DEFAULT_CONFIG
│ └── models.py # DetectionResult, DetectionSignal, InvalidImageError
├── tests/
│ ├── conftest.py
│ ├── fixtures.py # programmatic PDF generators (reportlab)
│ ├── fixtures/external/ # 13 vendored test PDFs (CC-BY-SA-4.0 + MIT)
│ ├── test_integrity.py # 29 tests
│ ├── test_render.py # 44 tests (smoke + edge)
│ ├── test_blank.py # 35 tests
│ ├── test_perf_budgets.py # 6 latency budget tests
│ └── test_integration.py # 23 end-to-end tests on vendored fixtures
├── example.py # runnable annotated demo
├── Makefile # setup, test, format, build, clean targets
├── pyproject.toml # hatchling build, requires-python >= 3.10
├── LICENSE
└── README.md
make setup Create venv with uv (run once)
make install Install package + dev deps in editable mode
make install-runtime Install runtime deps only
make test Run the full test suite
make test-smoke Run smoke tests only
make test-edge Run edge case tests only
make test-perf Run perf budget tests only
make test-integration Run end-to-end tests on vendored fixtures
make test-fast Run tests in parallel across CPU cores
make format isort + black + ruff fix + ruff format
make build Build the wheel and sdist
make verify-wheel Build and install the wheel into a fresh venv
make clean Remove caches and build artifacts
make help Print this list
make test137 tests covering smoke, edge cases, performance budgets, and end-to-end pipeline runs against vendored fixtures. Test suite typically runs in under 60 seconds.
The vendored test PDFs in tests/fixtures/external/ come from py-pdf/sample-files (CC-BY-SA-4.0) and ArturT/Test-PDF-Files (MIT). Licence texts and source mapping are in tests/fixtures/external/.
MIT License - see LICENSE file for details.