OCR-FORGE

A mini webpage which helps you convert a non-searchable PDF archive to a searchable one.

OCR·FORGE is a fully self-contained HTML file that you open directly in your browser. No setup. No server. No uploads.

What it does

OCR·FORGE turns scanned or image-based PDFs into searchable PDFs with selectable text layered invisibly beneath the page image.

How the pipeline works

PDF.js renders each page into a high-resolution <canvas> at 2× or 3× depending on the quality setting.
Tesseract.js runs OCR on that canvas and returns every word with its exact bounding box.
jsPDF builds the output PDF in a critical order:
- first, it writes the OCR text in white so it stays invisible,
- then it places the page image on top,
- the text remains underneath the image, but it is still searchable and selectable in any PDF viewer.

Features

9 available languages, including Spanish, English, Portuguese, and French
3 render quality levels
Real-time per-page log with confidence percentages
Thumbnail previews
100% local processing
Nothing is uploaded to any server
Word-level bounding boxes are scaled correctly from canvas pixels to PDF points

Tech stack

HTML5
CSS3
JavaScript
PDF.js
Tesseract.js
jsPDF

Privacy

Everything runs locally in the browser. Your files stay on your device during the whole process.

AI assistance

This project was created with help from Claude, using the Sonnet 4.6 Adaptative model with Tool Access set to Always Available.

The following prompts were used:

A continuación, genera el código HTML5, CSS3, y JavaScript.
De una página web que use Tesseract.js y PDF.js para convertir cualquier PDF que el usuario suba, en un PDF seleccionable
Continua

Use

Open the HTML file in your browser.
Upload a PDF.
Choose the language and quality level.
Start OCR processing.
Download the searchable PDF.

Output behavior

The generated PDF keeps the original page appearance while adding a hidden text layer for search and selection support.

Notes

Best results come from clean scans and high-quality source PDFs.
Multi-page documents are processed page by page.
OCR confidence is shown in the live log so you can track recognition quality as it runs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR-FORGE

What it does

How the pipeline works

Features

Tech stack

Privacy

AI assistance

Use

Output behavior

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCR-FORGE

What it does

How the pipeline works

Features

Tech stack

Privacy

AI assistance

Use

Output behavior

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages