Skip to content

CharlieMcVicker/chr-ocr

Repository files navigation

Scripts for doing OCR (text recognition) on the Cherokee Phoenix

This repository contains CLI scripts and a full Flask web application for processing and viewing OCR results of Cherokee document images.

System Dependencies

Before running the scripts or server, ensure you have the following system packages installed:

  • imagemagick
  • tesseract
  • Ensure the Cherokee training data file chr.traineddata (available from Tesseract OCR Tessdata Repository) is placed in your system's tessdata directory.

Docker Setup & Requirements

If you want to containerize this application, here is a brief overview of how and where to install the required system libraries, language data, and Python dependencies:

1. Base Image and System Libraries

Use a Python base image (such as python:3.11-slim) and install tesseract-ocr and imagemagick via apt:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    imagemagick \
    tesseract-ocr \
    curl \
    && rm -rf /var/lib/apt/lists/*

2. Cherokee Language Data (chr.traineddata)

Tesseract needs the chr.traineddata file to perform Cherokee OCR. Download it and place it into the tessdata folder of the image. For standard Debian-based slim images, this path is typically /usr/share/tesseract-ocr/5/tessdata/ (or /usr/share/tesseract-ocr/4.00/tessdata/ depending on the Tesseract version):

RUN mkdir -p /usr/share/tesseract-ocr/5/tessdata/ \
    && curl -L -o /usr/share/tesseract-ocr/5/tessdata/chr.traineddata \
    https://github.com/tesseract-ocr/tessdata/raw/main/chr.traineddata

3. Application Dependencies & Run

Copy the application source code, install Python packages, and set up the startup command:

WORKDIR /app
COPY server/requirements.txt ./server/requirements.txt
RUN pip install --no-cache-dir -r server/requirements.txt

COPY . .

ENV PORT=5001
EXPOSE 5001

CMD ["python", "server/app.py"]

1. Web OCR Server & Dashboard

The web app provides a visual UI to upload PNG files, clean them automatically, perform OCR, and view the results in an interactive overlay.

Setup and Running Local Server

  1. Create and Activate Virtual Environment:

    python3 -m venv .venv
    source .venv/bin/activate
  2. Install Dependencies:

    pip install -r server/requirements.txt
  3. Start the Flask Server:

    PORT=5001 python server/app.py
  4. Open your browser and go to: http://localhost:5001

Configuration

You can customize the server by setting environment variables or creating a .env file:

  • PORT: Server port (default: 5000)
  • UPLOAD_DIR: Path where uploads and results are stored (default: <project-root>/uploads)

2. CLI Scripts

You can also run image processing and OCR directly from your terminal.

Clean an image file

./scripts/clean-img ./path/to/image.png
ls ./path/to/image.png-out.png

Run OCR on an image file

./scripts/call-tesseract ./path/to/image.png-out.png

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors