This repository contains CLI scripts and a full Flask web application for processing and viewing OCR results of Cherokee document images.
Before running the scripts or server, ensure you have the following system packages installed:
imagemagicktesseract- Ensure the Cherokee training data file
chr.traineddata(available from Tesseract OCR Tessdata Repository) is placed in your system'stessdatadirectory.
If you want to containerize this application, here is a brief overview of how and where to install the required system libraries, language data, and Python dependencies:
Use a Python base image (such as python:3.11-slim) and install tesseract-ocr and imagemagick via apt:
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
imagemagick \
tesseract-ocr \
curl \
&& rm -rf /var/lib/apt/lists/*Tesseract needs the chr.traineddata file to perform Cherokee OCR. Download it and place it into the tessdata folder of the image. For standard Debian-based slim images, this path is typically /usr/share/tesseract-ocr/5/tessdata/ (or /usr/share/tesseract-ocr/4.00/tessdata/ depending on the Tesseract version):
RUN mkdir -p /usr/share/tesseract-ocr/5/tessdata/ \
&& curl -L -o /usr/share/tesseract-ocr/5/tessdata/chr.traineddata \
https://github.com/tesseract-ocr/tessdata/raw/main/chr.traineddataCopy the application source code, install Python packages, and set up the startup command:
WORKDIR /app
COPY server/requirements.txt ./server/requirements.txt
RUN pip install --no-cache-dir -r server/requirements.txt
COPY . .
ENV PORT=5001
EXPOSE 5001
CMD ["python", "server/app.py"]The web app provides a visual UI to upload PNG files, clean them automatically, perform OCR, and view the results in an interactive overlay.
-
Create and Activate Virtual Environment:
python3 -m venv .venv source .venv/bin/activate -
Install Dependencies:
pip install -r server/requirements.txt
-
Start the Flask Server:
PORT=5001 python server/app.py
-
Open your browser and go to:
http://localhost:5001
You can customize the server by setting environment variables or creating a .env file:
PORT: Server port (default:5000)UPLOAD_DIR: Path where uploads and results are stored (default:<project-root>/uploads)
You can also run image processing and OCR directly from your terminal.
./scripts/clean-img ./path/to/image.png
ls ./path/to/image.png-out.png./scripts/call-tesseract ./path/to/image.png-out.png