Scripts for doing OCR (text recognition) on the Cherokee Phoenix

This repository contains CLI scripts and a full Flask web application for processing and viewing OCR results of Cherokee document images.

System Dependencies

Before running the scripts or server, ensure you have the following system packages installed:

imagemagick
tesseract
Ensure the Cherokee training data file chr.traineddata (available from Tesseract OCR Tessdata Repository) is placed in your system's tessdata directory.

Docker Setup & Requirements

If you want to containerize this application, here is a brief overview of how and where to install the required system libraries, language data, and Python dependencies:

1. Base Image and System Libraries

Use a Python base image (such as python:3.11-slim) and install tesseract-ocr and imagemagick via apt:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    imagemagick \
    tesseract-ocr \
    curl \
    && rm -rf /var/lib/apt/lists/*

2. Cherokee Language Data (`chr.traineddata`)

Tesseract needs the chr.traineddata file to perform Cherokee OCR. Download it and place it into the tessdata folder of the image. For standard Debian-based slim images, this path is typically /usr/share/tesseract-ocr/5/tessdata/ (or /usr/share/tesseract-ocr/4.00/tessdata/ depending on the Tesseract version):

RUN mkdir -p /usr/share/tesseract-ocr/5/tessdata/ \
    && curl -L -o /usr/share/tesseract-ocr/5/tessdata/chr.traineddata \
    https://github.com/tesseract-ocr/tessdata/raw/main/chr.traineddata

3. Application Dependencies & Run

Copy the application source code, install Python packages, and set up the startup command:

WORKDIR /app
COPY server/requirements.txt ./server/requirements.txt
RUN pip install --no-cache-dir -r server/requirements.txt

COPY . .

ENV PORT=5001
EXPOSE 5001

CMD ["python", "server/app.py"]

1. Web OCR Server & Dashboard

The web app provides a visual UI to upload PNG files, clean them automatically, perform OCR, and view the results in an interactive overlay.

Setup and Running Local Server

Create and Activate Virtual Environment:

python3 -m venv .venv
source .venv/bin/activate

Install Dependencies:
```
pip install -r server/requirements.txt
```
Start the Flask Server:
```
PORT=5001 python server/app.py
```
Open your browser and go to: http://localhost:5001

Configuration

You can customize the server by setting environment variables or creating a .env file:

PORT: Server port (default: 5000)
UPLOAD_DIR: Path where uploads and results are stored (default: <project-root>/uploads)

2. CLI Scripts

You can also run image processing and OCR directly from your terminal.

Clean an image file

./scripts/clean-img ./path/to/image.png
ls ./path/to/image.png-out.png

Run OCR on an image file

./scripts/call-tesseract ./path/to/image.png-out.png

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts		scripts
server		server
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
out-test.png-out.png.txt		out-test.png-out.png.txt
python server spec.md		python server spec.md
test-two.png		test-two.png
test-two.png-out.png		test-two.png-out.png
test-two.png-tmp.png		test-two.png-tmp.png
test-two.png-tmp2.png		test-two.png-tmp2.png
test.png		test.png
test.png-out.png		test.png-out.png
test.png-tmp.png		test.png-tmp.png
test.png-tmp2.png		test.png-tmp2.png
test.png-tmp3.png		test.png-tmp3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts for doing OCR (text recognition) on the Cherokee Phoenix

System Dependencies

Docker Setup & Requirements

1. Base Image and System Libraries

2. Cherokee Language Data (`chr.traineddata`)

3. Application Dependencies & Run

1. Web OCR Server & Dashboard

Setup and Running Local Server

Configuration

2. CLI Scripts

Clean an image file

Run OCR on an image file

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scripts for doing OCR (text recognition) on the Cherokee Phoenix

System Dependencies

Docker Setup & Requirements

1. Base Image and System Libraries

2. Cherokee Language Data (chr.traineddata)

3. Application Dependencies & Run

1. Web OCR Server & Dashboard

Setup and Running Local Server

Configuration

2. CLI Scripts

Clean an image file

Run OCR on an image file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Cherokee Language Data (`chr.traineddata`)

Packages