Skip to content

kamalmahmud/RWF2000_TSM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🥊 RWF2000 Violence Detection with TSM + MobileNet

A deep learning pipeline for binary violence detection in videos using the RWF-2000 dataset and a Temporal Shift Module (TSM) applied on top of a lightweight MobileNet backbone.


📌 Overview

This project tackles the task of automatically detecting violent behavior in surveillance-style video clips. Each video is classified into one of two categories:

Label Class
0 NonFight
1 Fight

The model combines the efficiency of MobileNet with the temporal modeling power of TSM, enabling it to reason across video frames without the heavy computation of 3D convolutions.


📂 Project Structure

RWF2000_TSM/
├── config.py        # All hyperparameters and paths (CFG class)
├── train.py         # Training and evaluation loop
├── data/            # Dataset loading (RWF2000DatasetJPEG)
├── models/          # TSMMobileNet model definition
├── util/            # Frame extraction utilities
└── README.md

🧠 Model Architecture

TSMMobileNet — Temporal Shift Module wrapped around MobileNet:

  • Backbone: MobileNet (pretrained on ImageNet)
  • Temporal Module: TSM shifts a portion of channels along the time dimension, allowing the 2D backbone to implicitly capture motion across frames — zero extra parameters
  • Input format: (B, T×3, H, W) — segments are channel-stacked
  • Output: 2-class softmax logits (Fight / NonFight)

⚙️ Configuration

All settings live in config.py under the CFG class:

Parameter Value Description
NUM_SEGMENTS 8 Frames sampled per video (T)
IMG_SIZE 224 Input spatial resolution
BATCH_SIZE 8 Training batch size
EPOCHS 30 Number of training epochs
LR 1e-3 Initial learning rate (AdamW)
LR_STEPS [5, 15] Epoch milestones for LR decay
LR_GAMMA 0.1 LR decay factor
WEIGHT_DECAY 1e-3 AdamW weight decay
NUM_WORKERS 2 DataLoader worker threads
SEED 42 Reproducibility seed

📦 Dataset

RWF-2000 — A large-scale video dataset for violence detection:

  • 2,000 video clips collected from surveillance cameras
  • 50/50 split between fight and non-fight clips
  • Train/validation split provided by the dataset

The pipeline pre-extracts frames to JPEG before training for faster I/O:

extract_frames_to_jpeg(
    data_root=CFG.DATA_ROOT,
    out_root=FRAME_ROOT,
    num_segments=CFG.NUM_SEGMENTS,
    img_size=CFG.IMG_SIZE + 32,  # 256px → RandomCrop(224)
    quality=95,
)

Note: The dataset path is configured for Kaggle: /kaggle/input/.../RWF-2000


🚀 Training

python train.py

The training loop uses:

  • Loss: CrossEntropyLoss
  • Optimizer: AdamW (lr=1e-3, weight_decay=1e-3)
  • Epochs: 40 (as run in train.py)
  • Hardware: CUDA GPU (falls back to CPU automatically)

Training progress is printed every 100 batches. After each epoch, validation accuracy is computed over the full val set.


📊 Results

Metric Value
Validation Accuracy 72%
Dataset RWF-2000
Model TSM + MobileNet
Segments (T) 8
Image Size 224×224

🔧 Requirements

torch
torchvision
numpy
opencv-python   # for frame extraction

Install with:

pip install torch torchvision numpy opencv-python

🗂️ Running on Kaggle

This project is designed to run on Kaggle with GPU acceleration:

  1. Add the RWF-2000 dataset to your Kaggle notebook
  2. Update CFG.DATA_ROOT in config.py if needed (default points to the Kaggle input path)
  3. Run train.py — frames will be extracted to /kaggle/working/rwf2000_frames/
  4. The best model checkpoint is saved to /kaggle/working/tsm_mobilenet_best.pth

📖 References

About

Video violence detection with TSM + MobileNet on the RWF-2000 dataset. PyTorch-based training pipeline with frame extraction, data augmentation, and 72% val accuracy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages