A deep learning pipeline for binary violence detection in videos using the RWF-2000 dataset and a Temporal Shift Module (TSM) applied on top of a lightweight MobileNet backbone.
This project tackles the task of automatically detecting violent behavior in surveillance-style video clips. Each video is classified into one of two categories:
| Label | Class |
|---|---|
0 |
NonFight |
1 |
Fight |
The model combines the efficiency of MobileNet with the temporal modeling power of TSM, enabling it to reason across video frames without the heavy computation of 3D convolutions.
RWF2000_TSM/
├── config.py # All hyperparameters and paths (CFG class)
├── train.py # Training and evaluation loop
├── data/ # Dataset loading (RWF2000DatasetJPEG)
├── models/ # TSMMobileNet model definition
├── util/ # Frame extraction utilities
└── README.md
TSMMobileNet — Temporal Shift Module wrapped around MobileNet:
- Backbone: MobileNet (pretrained on ImageNet)
- Temporal Module: TSM shifts a portion of channels along the time dimension, allowing the 2D backbone to implicitly capture motion across frames — zero extra parameters
- Input format:
(B, T×3, H, W)— segments are channel-stacked - Output: 2-class softmax logits (Fight / NonFight)
All settings live in config.py under the CFG class:
| Parameter | Value | Description |
|---|---|---|
NUM_SEGMENTS |
8 |
Frames sampled per video (T) |
IMG_SIZE |
224 |
Input spatial resolution |
BATCH_SIZE |
8 |
Training batch size |
EPOCHS |
30 |
Number of training epochs |
LR |
1e-3 |
Initial learning rate (AdamW) |
LR_STEPS |
[5, 15] |
Epoch milestones for LR decay |
LR_GAMMA |
0.1 |
LR decay factor |
WEIGHT_DECAY |
1e-3 |
AdamW weight decay |
NUM_WORKERS |
2 |
DataLoader worker threads |
SEED |
42 |
Reproducibility seed |
RWF-2000 — A large-scale video dataset for violence detection:
- 2,000 video clips collected from surveillance cameras
- 50/50 split between fight and non-fight clips
- Train/validation split provided by the dataset
The pipeline pre-extracts frames to JPEG before training for faster I/O:
extract_frames_to_jpeg(
data_root=CFG.DATA_ROOT,
out_root=FRAME_ROOT,
num_segments=CFG.NUM_SEGMENTS,
img_size=CFG.IMG_SIZE + 32, # 256px → RandomCrop(224)
quality=95,
)Note: The dataset path is configured for Kaggle:
/kaggle/input/.../RWF-2000
python train.pyThe training loop uses:
- Loss:
CrossEntropyLoss - Optimizer:
AdamW(lr=1e-3, weight_decay=1e-3) - Epochs: 40 (as run in train.py)
- Hardware: CUDA GPU (falls back to CPU automatically)
Training progress is printed every 100 batches. After each epoch, validation accuracy is computed over the full val set.
| Metric | Value |
|---|---|
| Validation Accuracy | 72% |
| Dataset | RWF-2000 |
| Model | TSM + MobileNet |
| Segments (T) | 8 |
| Image Size | 224×224 |
torch
torchvision
numpy
opencv-python # for frame extraction
Install with:
pip install torch torchvision numpy opencv-pythonThis project is designed to run on Kaggle with GPU acceleration:
- Add the RWF-2000 dataset to your Kaggle notebook
- Update
CFG.DATA_ROOTinconfig.pyif needed (default points to the Kaggle input path) - Run
train.py— frames will be extracted to/kaggle/working/rwf2000_frames/ - The best model checkpoint is saved to
/kaggle/working/tsm_mobilenet_best.pth