Real Time Sign Language Translator with Mediapipe

This is a Kazakh Sign Language Translator that recognizes keypoints using Google's Mediapipe framework and converts gestures into words.

Run

Clone the repository:

git clone https://github.com/qonstant/SignLanguageMediapipe.git

Install dependencies which was used in this project with Python 3.11.8:

pip3 install opencv-python==4.9.0
pip3 install numpy==1.26.4
pip3 install mediapipe==0.9.2.1

Introduction:

In Kazakhstan, the absence of a Kazakh Sign Language translator app leaves the deaf and hard of hearing community without a vital tool for communication. This project fills a crucial gap by providing a means to bridge linguistic barriers, ensuring inclusivity and equal access to communication for all. Its importance lies not only in addressing an urgent need but also in pioneering innovation where none exists.

What is Mediapipe?

The MediaPipe Gesture Recognizer facilitates the creation of machine learning models that track only the position of your hand, rather than the entire picture. This focused approach saves time during training, reduces memory consumption, and accelerates model development. By leveraging data from an Excel table containing coordinates of the fingers, it enables faster model training, allowing for real-time recognition of hand gestures and efficient integration of corresponding application features.

Data:

I have used my own dataset.

When saving a position of key points, you will need to press a specific key along with its corresponding ID. Consequently, the ID will be stored in the first column, while the coordinates of the keypoints will be stored in subsequent columns.

This is how it gets coordinates of the landmarks(hand key points):

def calc_landmark_list(image, landmarks):
    image_width, image_height = image.shape[1], image.shape[0]

    landmark_point = []

    # Keypoint
    for _, landmark in enumerate(landmarks.landmark):
        landmark_x = min(int(landmark.x * image_width), image_width - 1)
        landmark_y = min(int(landmark.y * image_height), image_height - 1)

        landmark_point.append([landmark_x, landmark_y])

    return landmark_point

Datasets structure

There are only 14 static words and 4 movements available yet. 3456 rows for static words and 5296 for actions.

Static words save

Model building

Here we have 2 models, one for action with using LSTM ( Long Short-Term Memory) and another for static words without LSTM.

First one:

use_lstm = False
model = None

if use_lstm:
    model = tf.keras.models.Sequential([
        tf.keras.layers.InputLayer(input_shape=(TIME_STEPS * DIMENSION, )),
        tf.keras.layers.Reshape((TIME_STEPS, DIMENSION), input_shape=(TIME_STEPS * DIMENSION, )),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.LSTM(16, input_shape=[TIME_STEPS, DIMENSION]),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
    ])
else:
    model = tf.keras.models.Sequential([
        tf.keras.layers.InputLayer(input_shape=(TIME_STEPS * DIMENSION, )),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(24, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
    ])

When use_lstm is True, the model includes LSTM layers, suitable for sequence data processing.
When use_lstm is False, the model is a simpler architecture without LSTM layers.

Second one:

model = tf.keras.models.Sequential([
    tf.keras.layers.Input((21 * 2, )),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(20, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

Input Layer: Accepts input data with a shape of 42 (21 * 2), which presumably represents 21 keypoints with 2 coordinates each.
Dropout Layer (0.2): Applies dropout regularization to randomly deactivate 20% of the input units during training to prevent overfitting.
Dense Layer (20 units): A fully connected layer with 20 units, applying the Rectified Linear Unit (ReLU) activation function for non-linearity.
Dropout Layer (0.4): Another dropout layer, this time with a rate of 40%.
Dense Layer (10 units): Another fully connected layer with 10 units and ReLU activation.
Output Layer: Produces the final classification output with NUM_CLASSES units and softmax activation, suitable for multi-class classification tasks.

Model Training

Training Results

Model for recognition of actions:

Model for recognition of static words:

Results

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
model		model
utils		utils
.DS_Store		.DS_Store
README.md		README.md
app.py		app.py
keypoint_classification.ipynb		keypoint_classification.ipynb
point_history_classification.ipynb		point_history_classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Time Sign Language Translator with Mediapipe

Run

Introduction:

What is Mediapipe?

Data:

Datasets structure

Model building

Model Training

Training Results

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real Time Sign Language Translator with Mediapipe

Run

Introduction:

What is Mediapipe?

Data:

Datasets structure

Model building

Model Training

Training Results

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages