AlphaNet is an end-to-end pipeline that takes images containing text and outputs a textual representation of that content. The project is divided into six stages, each responsible for a specific part of the image-to-text process, and orchestrated by a single main_pipeline.py script.
- Overview
- Pipeline Stages
- Installation
- Usage
- Roadmap
- System Interfaces and Visual Representation
- License
AlphaNet is designed to convert images into accurate text by breaking down the recognition task into multiple stages:
- Segmenting images into words
- Extracting characters from these words
- Classifying each character
- Converting those classifications into text
- Correcting the recognized text using a Large Language Model (LLM)
- Outputting the final result
This modular approach makes it easy to replace or improve individual components without affecting the rest of the pipeline. AlphaNet stands out by combining modularity with LLM-powered corrections, ensuring high accuracy in both structured and unstructured text.
The following diagram illustrates the workflow of the model:
-
Segmentation Module:
- Takes a
.pnginput, segments the characters in the image, and outputs them sequentially.
- Takes a
-
Classification Module:
- Processes the segmented characters and generates their corresponding numeric representations (e.g.,
[4, 23, 0, 11, 15, 11, 4]).
- Processes the segmented characters and generates their corresponding numeric representations (e.g.,
-
Conversion Module:
- Converts the numeric representations into their text equivalent, identifying and marking any errors.
-
Correction Module:
- Corrects errors in the text (e.g., replacing incorrect characters, as shown with the red N).
-
Output:
- Produces the corrected text, ensuring it matches the expected input.
- File/Module:
stage_1 - Purpose: Takes sentence images, breaks them into words, and further segments these into single characters for the next stage.
- File/Module:
stage_2 - Purpose: Classifies each character image (from Stage 1) and maps it to the corresponding textual character.
- File/Module:
stage_3 - Purpose: Converts the classification outputs (which may be in vector/label format) into text strings.
- File/Module:
stage_4 - Purpose: Refines and corrects the recognized text using a Large Language Model (LLM) to ensure higher accuracy and resolve ambiguities.
- File/Module:
stage_5 - Purpose: Finalizes the text (e.g., formatting, post-processing) and provides it as output (console, file, or other desired format).
- Python 3.12 or later.
- pip (Python package manager).
- Internet connection for downloading LLM models.
-
Clone this repository:
git clone https://github.com/your-username/AlphaNet.git cd AlphaNet git lfs pull #if not already done automatcally
-
Classification DataSet
Example Dataset: Preprocessed datasets are available hereunzip Downloaded/path/DataSet.zip # mv -r Downloaded/path/Dataset Stage_2_Classification_Module/DataSet # or path/to/AlphaNet/Stage_2_Classification_Module
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Linux/MacOS venv\Scripts\activate # Windows
-
Install the requerirments:
pip install -r requirements.txt
-
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh #for linux
or follow the link and download from Ollama's website.
-
Install LLama2
ollama serve& ollama pull llama2 -
Install VIT model VIT Model: Available for download here.
mv Downloaded/path/best_model_ViT_2025-01-01.pth Stage_2_Classification_Module/models/vit_models/best_model_ViT_2025-01-01.pth # or path/to/AlphaNet/Stage_2_Classification_Module
To start the graphical user interface, execute the following command in your terminal:
python Project_Main/run_gui.pyThis command will create a server, in which you can access from your browser using the following url:
http://127.0.0.1:7860
Also you can run the pipeline without gui, use the following command:
python Project_Main/Main_Pipeline.pyTo launch the application, run:
python Project_Main/run_gui.py- Create an OCR pipeline for converting full sentences to text
- Support panctuation marks
- Improving Segentation
- Adjust the pipeline to IAM Handwritten Forms Dataset
- Multi-language Support
- Chinese
- Spanish
The AlphaNet system incorporates several user interfaces, designed to optimize user experience and operational efficiency. Below is a detailed description of each interface, accompanied by visual representations.
The Upload and Process Interface serves as the initial interaction point for users. It supports:
- Drag-and-Drop Functionality: Enables seamless image upload without navigating file directories.
- Real-Time Image Preview: Provides immediate feedback to verify uploaded content.
- Instant Image-to-text Options: In a single button you can run the whole pipeline, and get a visual output.
This interface exemplifies usability by combining simplicity with efficiency.
The Generate and Process Interface is for text-based image creation and transformation to text. Key features include:
- Custom Text Generation: Users can input custom text to generate corresponding images.
- Font Customization: A variety of font styles are available to enhance personalization.
- Integrated Processing: Generated images can be processed instantly within the interface.
- Clean Images: Remove the last uses of the system, for clean and optimal usage. This component bridges creative input with functional output.
The Directory Management Interface is designed for efficient system maintenance, providing tools to manage operational directories. It includes:
- Reset Options: Enables users to restore directories to their default state.
- Real-Time Feedback: Displays the current status of directories for enhanced oversight.
This interface ensures smooth backend operation, critical for system stability.
The About Section offers an overview of the system’s core functionalities in a structured and accessible manner. It highlights:
- System Capabilities: Upload, text generation, processing, and directory management tools.
- Key Features: Modular design, deep learning integration, and GUI accessibility.
This section functions as a comprehensive introduction to AlphaNet.
These visual interfaces exemplify the modular and user-centric design of AlphaNet. Each interface addresses specific stages of the image-to-text conversion pipeline, ensuring both ease of use and operational efficiency. Together, they form an integral part of the system's accessibility and functionality.
This project is licensed under the MIT License. See the LICENSE file for details.




