Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs
Abstract
This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:
(1) a proposal network -- identifies candidate segments in a long video that may contain actions
(2) a classification network (very important for training) -- serve as initialization for the localization network
(3) a localization network -- fine-tunes the learned classification network to localize each action instance
Detailed descriptions of Segment-CNN

Multi - scale segment generation

Each frame is resized to 171 X 128 pixels
For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)
Network architecture
Their deep networks use C3D as the basic archietecture in all stages .
cnv1a(64) - pool1(1,1) -
-conv2a(128) - pool2(2,2) -
-conv3a(256) - conv3b(256) - pool3(2,2) -
-conv4a(512) - conv4b(512) - pool4(2,2) -
-conv5a(512) - conv5b(512) - pool5(2,2) -
-fc6(4096) - fc7(4096) - fc8(K+1)
Each input for this deep network is a segment s of dimention 171 X 128 X 16 .
Training procedure
And Impact of individual networks
Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)
1) The proposal network

label k:{0,1}
For each segment of the trimmed video , set its label as positive.
For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )

reduce the number of operations conducted on background segments
2) The classification network

label: background and action k:{1....K}
In order to balance the number of training data for each class , this paper reduce the number of background instances to

better pormance
3) The localization network

Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration
The new loss function is formed by combining L(softmax) and L(overlap):


better pormance
Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs
Abstract
This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:
(1) a proposal network -- identifies candidate segments in a long video that may contain actions
(2) a classification network (very important for training) -- serve as initialization for the localization network
(3) a localization network -- fine-tunes the learned classification network to localize each action instance
Detailed descriptions of Segment-CNN
Multi - scale segment generation
Each frame is resized to 171 X 128 pixels
For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)
Network architecture
Their deep networks use C3D as the basic archietecture in all stages .
cnv1a(64) - pool1(1,1) -
-conv2a(128) - pool2(2,2) -
-conv3a(256) - conv3b(256) - pool3(2,2) -
-conv4a(512) - conv4b(512) - pool4(2,2) -
-conv5a(512) - conv5b(512) - pool5(2,2) -
-fc6(4096) - fc7(4096) - fc8(K+1)
Each input for this deep network is a segment s of dimention 171 X 128 X 16 .
Training procedure
And Impact of individual networks
Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)
1) The proposal network
label k:{0,1}
For each segment of the trimmed video , set its label as positive.
For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )
reduce the number of operations conducted on background segments
2) The classification network
label: background and action k:{1....K}
In order to balance the number of training data for each class , this paper reduce the number of background instances to
better pormance
3) The localization network
Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration
The new loss function is formed by combining L(softmax) and L(overlap):
better pormance