GitHub - fw7th/redact: Document redaction microservice.

redact-logo

🔒 A fast, containerized, and scalable API service for automated OCR and PII redaction on document batches.

Redact is a production-ready microservice built with FastAPI that handles the secure and asynchronous processing of data redaction tasks. Designed to be easily deployed using Docker and scaled via a worker-based architecture (Redis/RQ setup).

Features

RESTful API: Clear endpoints for submitting and retrieving redaction tasks.
Asynchronous Processing: Long-running redaction tasks are handled by a dedicated worker pool, keeping the API fast and responsive.
Persistent Storage: Uses SQLAlchemy for task metadata and Redis for task queuing.
Containerized: Built for easy deployment with a Dockerfile.
Benchmarked: Includes load testing scripts using Locust and benchmarks for performance analysis.

Random image I found on the internet vs. Redacted. PS: If it's your image, I'm sorry 😅

Performance

API Benchmarking Table

Endpoint	Operation	Payload Size	Concurrent Users	Requests/sec	Avg Latency (ms)	P95 Latency (ms)	Error Rate	Notes
`POST /predict`	Create	100KB image	2	0.67	34.95	56	0%	Includes file validation, disk write, Redis, ML model inference
`GET /download/{id}`	GET	N/A	10	3.58	10.38	32	0%	Retrieve processed document from server
`GET /check/{id}`	Read	N/A	10	3.52	9.94	28	0%	Fetches job status from Redis
`DELETE /drop/{id}`	Delete	N/A	10	3.47	10.43	30		Delete a batch and all related files from DB

Legend:

Payload Size: Size of file or JSON sent in the request.

Concurrent Users: Simulated users (e.g., in Locust).

Requests/sec: Throughput under load.

Latency: Time from request to response (P95 = 95th percentile).

Model Inference Benchmark Table

[Models]

[NER: GLiNER (Medium v2.1 {Remember to cite urchade and the GLiNER paper})]
[OCR: Tesseract 5.5.1 w/ py-tesseract]

Model Name	Input Size	Avg Inference Time (ms)	Device	Notes
`GLiNER v2.1`	200 lines of text	790	CPU	Model has a long cold start 20secs-ish
`Tesseract`	512x512 image	0.96	CPU	Tesseract?

Legend:

Inference Time: Time to run prediction (ms).

Throughput: How many predictions/sec the model can handle.

Tests performed in a subprocess.

Quickstart

Clone the repo

git clone https://github.com/fw7th/redact.git
cd redact

Create and activate virtualenv (optional)

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install dependencies

pip install -r requirements.txt

Copy example env and configure

cp .env.example .env

Start the app

uvicorn app.main:app --reload

Client request example

curl -X POST http://localhost:8000/predict \
  -F "files=@cat.jpg"

Check job status

curl http://localhost:8000/predict/abc123

Running tests

pytest tests/

Optionally benchmark

./scripts/benchmark

API Documentation

[Link to /docs once deployed]

When the server is running locally, visit:

http://localhost:8000/docs — Swagger UI
http://localhost:8000/redoc — ReDoc

These provide interactive documentation of all available endpoints with live testing.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Citations

GLiNER was pivotal in the development of the PII redaction module.

@misc{stepanov2024glinermultitaskgeneralistlightweight,
      title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks}, 
      author={Ihor Stepanov and Mykhailo Shtopko},
      year={2024},
      eprint={2406.12925},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.12925}, 
}

Tesseract, the Open Source OCR engine, performs text extraction in this project.

@Manual{,
  title = {tesseract: Open Source OCR Engine},
  author = {Jeroen Ooms},
  year = {2026},
  note = {R package version 5.2.5},
  url = {https://docs.ropensci.org/tesseract/},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Performance

API Benchmarking Table

Model Inference Benchmark Table

Quickstart

Clone the repo

Create and activate virtualenv (optional)

Install dependencies

Copy example env and configure

Start the app

Client request example

Check job status

Running tests

Optionally benchmark

API Documentation

License

Citations

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github		.github
app		app
assets/favicon_io		assets/favicon_io
benchmarks		benchmarks
client		client
redact		redact
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
LOCAL.md		LOCAL.md
PII		PII
README.md		README.md
convert_to_onnx.py		convert_to_onnx.py
convert_to_onnx.sh		convert_to_onnx.sh
modal-requirements.txt		modal-requirements.txt
requirements.txt		requirements.txt
v2		v2

License

fw7th/redact

Folders and files

Latest commit

History

Repository files navigation

Features

Performance

API Benchmarking Table

Model Inference Benchmark Table

Quickstart

Clone the repo

Create and activate virtualenv (optional)

Install dependencies

Copy example env and configure

Start the app

Client request example

Check job status

Running tests

Optionally benchmark

API Documentation

License

Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages