Skip to content

NCUE-EE-AIAL/Two-step-Authentication-Multi-biometric-System

Repository files navigation

IET Β  arXiv Β 

A two-step biometric authentication system combining face recognition (VGG16 + MTCNN) and voice recognition (ResNet + Triplet Loss) for robust identity verification.

Key Features Architecture Getting Started Results


Key Features

πŸ‘€ Face Recognition

Fine-tuned VGG16 with MTCNN face detection, data augmentation, and two-stage training (frozen β†’ unfrozen layers)

πŸŽ™οΈ Voice Preprocessing

VAD (Voice Activity Detection) and Fbank feature extraction with FLAC-to-WAV conversion for LibriSpeech

⚑ Real-Time Inference

Webcam-based face capture and live speaker verification with confidence thresholds

πŸ”Š Voice Recognition

Custom ResNet architecture with triplet loss and cosine similarity for speaker verification

πŸ”„ Two-Stage Training

Random batch pre-training followed by selected batch refinement for optimized convergence

πŸ“Š Comprehensive Evaluation

Accuracy, EER, precision, recall, and F-measure metrics with training curve visualization


Architecture

System Prototype

workflow

Face Recognition Model

face model

Voice Recognition Model

graph TD
    subgraph CNNs
        A2[Input Layer] --> B2[ResNet Block: filter=64]
        B2 --> C2[ResNet Block: filter=128]
        C2 --> D2[ResNet Block: filter=256]
        D2 --> G2[ResNet Block: filter=512]

        G2 --> N2[Reshape & Mean]
        N2 --> P2[Dense 512]
        P2 --> Q2[Output Layer]
    end

    subgraph ResNet block
        A3[Input Tensor] --> B3[Conv2D Layer: kernel_size=5]
        B3 --> C3[BatchNormalization]
        C3 --> D3[Clipped ReLU]
        D3 --> E3[Identity Block * 3]
        E3 --> F3[Output Tensor]
    end

    subgraph Identity Block
        A[Input Tensor] --> B[Conv2D Layer: kernel_size=1]
        A[Input Tensor] --> J[+]
        B --> C[BatchNorm -> Clipped ReLU]
        C --> E[Conv2D Layer: kernel_size=3]
        E --> F[BatchNorm -> Clipped ReLU]
        F --> H[Conv2D Layer: kernel_size=1]
        H --> I[BatchNormalization]
        I --> J
        J --> K[Clipped ReLU]
        K --> L[Output Tensor]
    end
Loading

Dataset

Modality Source Details
Face Custom dataset Collected from EE's class students (see Dataset/ folder)
Voice LibriSpeech train-clean-360 for training, test-clean for evaluation

Getting Started

Project Structure

.
β”œβ”€β”€ image_preprocessing.py   # MTCNN face detection & cropping
β”œβ”€β”€ train_face.ipynb         # VGG16 fine-tuning for face recognition
β”œβ”€β”€ test_face.py             # Webcam-based face verification
β”œβ”€β”€ voice_preprocessing.py   # FLACβ†’WAV, VAD, Fbank extraction
β”œβ”€β”€ train_voice.py           # Two-stage voice model training
β”œβ”€β”€ test_voice.py            # Speaker verification evaluation
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models.py            # Model architectures
β”‚   β”œβ”€β”€ triplet_loss.py      # Triplet loss implementation
β”‚   β”œβ”€β”€ random_batch.py      # Random batch sampling
β”‚   β”œβ”€β”€ select_batch.py      # Selected batch sampling
β”‚   β”œβ”€β”€ silence_detector.py  # Audio silence detection
β”‚   β”œβ”€β”€ constants.py         # Configuration constants
β”‚   └── utils.py             # Utility functions
β”œβ”€β”€ eval/
β”‚   └── eval_metrics.py      # Evaluation metrics (EER, F-measure, etc.)
β”œβ”€β”€ Dataset/                 # Face image dataset
β”œβ”€β”€ doc/                     # Architecture diagrams & result graphs
└── checkpoints_sample/      # Sample model checkpoints

Usage

1. Face Recognition Pipeline

# Step 1: Preprocess face images (detect & crop faces)
python image_preprocessing.py

# Step 2: Train the face recognition model (open in Jupyter)
jupyter notebook train_face.ipynb

# Step 3: Test with webcam
python test_face.py

2. Voice Recognition Pipeline

# Step 1: Preprocess voice data (FLAC→WAV, VAD, Fbank)
python voice_preprocessing.py

# Step 2: Train the voice recognition model
python train_voice.py

# Step 3: Evaluate speaker verification
python test_voice.py

Script Details

Script Description
image_preprocessing.py Detects and crops faces from images using MTCNN, reads paths from CSV, saves cropped faces maintaining directory structure
train_face.ipynb Fine-tunes VGG16: freezes conv layers β†’ trains custom FC layers β†’ unfreezes and fine-tunes with lower learning rate
test_face.py Captures webcam frame, crops face via MTCNN, runs model prediction, outputs match label if confidence exceeds threshold
voice_preprocessing.py Processes LibriSpeech files (speaker-name format), converts FLAC→WAV, applies VAD and Fbank feature extraction
train_voice.py Two-stage training: random batches for initial convergence, then selected batches for refinement, with per-epoch validation
test_voice.py Evaluates speaker verification using triplet loss model with cosine similarity, reports accuracy, EER, precision, recall, F-measure

Results

System Accuracy Equal Error Rate Precision Recall
Face Recognition 95.135% - 96.317% 95.153%
Voice Recognition 99.1% 3.456% 86.48% 88.65%

face result

Training and validation curves for face recognition: (a) Accuracy (b) Loss

voice result

Training and validation curves for voice recognition: (a) EER (b) Loss


Acknowledgement

This research is supported by TEEP (Taiwan Experience Education Program) at National Changhua University of Education.


About

[ICETA 2024] A real-time, two-step multi-biometric authentication system combining VGG16-based facial recognition and ResNet-based voice verification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors