Skip to content

qonstant/SignLanguageMediapipe

Repository files navigation

Real Time Sign Language Translator with Mediapipe

This is a Kazakh Sign Language Translator that recognizes keypoints using Google's Mediapipe framework and converts gestures into words.

Run

  1. Clone the repository:
    git clone https://github.com/qonstant/SignLanguageMediapipe.git
  2. Install dependencies which was used in this project with Python 3.11.8:
    pip3 install opencv-python==4.9.0
    pip3 install numpy==1.26.4
    pip3 install mediapipe==0.9.2.1
    

Introduction:

In Kazakhstan, the absence of a Kazakh Sign Language translator app leaves the deaf and hard of hearing community without a vital tool for communication. This project fills a crucial gap by providing a means to bridge linguistic barriers, ensuring inclusivity and equal access to communication for all. Its importance lies not only in addressing an urgent need but also in pioneering innovation where none exists.

What is Mediapipe?

The MediaPipe Gesture Recognizer facilitates the creation of machine learning models that track only the position of your hand, rather than the entire picture. This focused approach saves time during training, reduces memory consumption, and accelerates model development. By leveraging data from an Excel table containing coordinates of the fingers, it enables faster model training, allowing for real-time recognition of hand gestures and efficient integration of corresponding application features.

Mediapipe

Data:

I have used my own dataset.

Dataset1

When saving a position of key points, you will need to press a specific key along with its corresponding ID. Consequently, the ID will be stored in the first column, while the coordinates of the keypoints will be stored in subsequent columns.

This is how it gets coordinates of the landmarks(hand key points):

def calc_landmark_list(image, landmarks):
    image_width, image_height = image.shape[1], image.shape[0]

    landmark_point = []

    # Keypoint
    for _, landmark in enumerate(landmarks.landmark):
        landmark_x = min(int(landmark.x * image_width), image_width - 1)
        landmark_y = min(int(landmark.y * image_height), image_height - 1)

        landmark_point.append([landmark_x, landmark_y])

    return landmark_point

Datasets structure

There are only 14 static words and 4 movements available yet. 3456 rows for static words and 5296 for actions.

Static words save

Dataset2

Model building

Here we have 2 models, one for action with using LSTM ( Long Short-Term Memory) and another for static words without LSTM.

First one:

use_lstm = False
model = None

if use_lstm:
    model = tf.keras.models.Sequential([
        tf.keras.layers.InputLayer(input_shape=(TIME_STEPS * DIMENSION, )),
        tf.keras.layers.Reshape((TIME_STEPS, DIMENSION), input_shape=(TIME_STEPS * DIMENSION, )),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.LSTM(16, input_shape=[TIME_STEPS, DIMENSION]),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
    ])
else:
    model = tf.keras.models.Sequential([
        tf.keras.layers.InputLayer(input_shape=(TIME_STEPS * DIMENSION, )),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(24, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
    ])
  • When use_lstm is True, the model includes LSTM layers, suitable for sequence data processing.
  • When use_lstm is False, the model is a simpler architecture without LSTM layers.

Second one:

model = tf.keras.models.Sequential([
    tf.keras.layers.Input((21 * 2, )),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(20, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])
  • Input Layer: Accepts input data with a shape of 42 (21 * 2), which presumably represents 21 keypoints with 2 coordinates each.
  • Dropout Layer (0.2): Applies dropout regularization to randomly deactivate 20% of the input units during training to prevent overfitting.
  • Dense Layer (20 units): A fully connected layer with 20 units, applying the Rectified Linear Unit (ReLU) activation function for non-linearity.
  • Dropout Layer (0.4): Another dropout layer, this time with a rate of 40%.
  • Dense Layer (10 units): Another fully connected layer with 10 units and ReLU activation.
  • Output Layer: Produces the final classification output with NUM_CLASSES units and softmax activation, suitable for multi-class classification tasks.

Model Training

Training

Training Results

Model for recognition of actions:

action_loss cm_action

Model for recognition of static words:

static_loss cm_static

Results

Result1

About

This is an ML model for translating gestures into Kazakh sign language.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors