A two-step biometric authentication system combining face recognition (VGG16 + MTCNN) and voice recognition (ResNet + Triplet Loss) for robust identity verification.
|
Fine-tuned VGG16 with MTCNN face detection, data augmentation, and two-stage training (frozen β unfrozen layers) VAD (Voice Activity Detection) and Fbank feature extraction with FLAC-to-WAV conversion for LibriSpeech Webcam-based face capture and live speaker verification with confidence thresholds |
Custom ResNet architecture with triplet loss and cosine similarity for speaker verification Random batch pre-training followed by selected batch refinement for optimized convergence Accuracy, EER, precision, recall, and F-measure metrics with training curve visualization |
graph TD
subgraph CNNs
A2[Input Layer] --> B2[ResNet Block: filter=64]
B2 --> C2[ResNet Block: filter=128]
C2 --> D2[ResNet Block: filter=256]
D2 --> G2[ResNet Block: filter=512]
G2 --> N2[Reshape & Mean]
N2 --> P2[Dense 512]
P2 --> Q2[Output Layer]
end
subgraph ResNet block
A3[Input Tensor] --> B3[Conv2D Layer: kernel_size=5]
B3 --> C3[BatchNormalization]
C3 --> D3[Clipped ReLU]
D3 --> E3[Identity Block * 3]
E3 --> F3[Output Tensor]
end
subgraph Identity Block
A[Input Tensor] --> B[Conv2D Layer: kernel_size=1]
A[Input Tensor] --> J[+]
B --> C[BatchNorm -> Clipped ReLU]
C --> E[Conv2D Layer: kernel_size=3]
E --> F[BatchNorm -> Clipped ReLU]
F --> H[Conv2D Layer: kernel_size=1]
H --> I[BatchNormalization]
I --> J
J --> K[Clipped ReLU]
K --> L[Output Tensor]
end
| Modality | Source | Details |
|---|---|---|
| Face | Custom dataset | Collected from EE's class students (see Dataset/ folder) |
| Voice | LibriSpeech | train-clean-360 for training, test-clean for evaluation |
.
βββ image_preprocessing.py # MTCNN face detection & cropping
βββ train_face.ipynb # VGG16 fine-tuning for face recognition
βββ test_face.py # Webcam-based face verification
βββ voice_preprocessing.py # FLACβWAV, VAD, Fbank extraction
βββ train_voice.py # Two-stage voice model training
βββ test_voice.py # Speaker verification evaluation
βββ src/
β βββ models.py # Model architectures
β βββ triplet_loss.py # Triplet loss implementation
β βββ random_batch.py # Random batch sampling
β βββ select_batch.py # Selected batch sampling
β βββ silence_detector.py # Audio silence detection
β βββ constants.py # Configuration constants
β βββ utils.py # Utility functions
βββ eval/
β βββ eval_metrics.py # Evaluation metrics (EER, F-measure, etc.)
βββ Dataset/ # Face image dataset
βββ doc/ # Architecture diagrams & result graphs
βββ checkpoints_sample/ # Sample model checkpoints
# Step 1: Preprocess face images (detect & crop faces)
python image_preprocessing.py
# Step 2: Train the face recognition model (open in Jupyter)
jupyter notebook train_face.ipynb
# Step 3: Test with webcam
python test_face.py# Step 1: Preprocess voice data (FLACβWAV, VAD, Fbank)
python voice_preprocessing.py
# Step 2: Train the voice recognition model
python train_voice.py
# Step 3: Evaluate speaker verification
python test_voice.py| Script | Description |
|---|---|
image_preprocessing.py |
Detects and crops faces from images using MTCNN, reads paths from CSV, saves cropped faces maintaining directory structure |
train_face.ipynb |
Fine-tunes VGG16: freezes conv layers β trains custom FC layers β unfreezes and fine-tunes with lower learning rate |
test_face.py |
Captures webcam frame, crops face via MTCNN, runs model prediction, outputs match label if confidence exceeds threshold |
voice_preprocessing.py |
Processes LibriSpeech files (speaker-name format), converts FLACβWAV, applies VAD and Fbank feature extraction |
train_voice.py |
Two-stage training: random batches for initial convergence, then selected batches for refinement, with per-epoch validation |
test_voice.py |
Evaluates speaker verification using triplet loss model with cosine similarity, reports accuracy, EER, precision, recall, F-measure |
| System | Accuracy | Equal Error Rate | Precision | Recall |
|---|---|---|---|---|
| Face Recognition | 95.135% | - | 96.317% | 95.153% |
| Voice Recognition | 99.1% | 3.456% | 86.48% | 88.65% |
Training and validation curves for face recognition: (a) Accuracy (b) Loss
Training and validation curves for voice recognition: (a) EER (b) Loss
This research is supported by TEEP (Taiwan Experience Education Program) at National Changhua University of Education.



