[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29)

# Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs

# Abstract

This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:

(1) a proposal network -- identifies candidate segments in a long video that may contain actions

(2) a classification network (very important for training) -- serve as initialization for the localization network

(3) a localization network -- fine-tunes the learned classification network to localize each action instance

# Detailed descriptions of Segment-CNN

![model](https://user-images.githubusercontent.com/30540446/38159754-98adabd6-34e1-11e8-985d-115dc4a83888.png)


## Multi - scale segment generation

![multi-scale segment generation](https://user-images.githubusercontent.com/30540446/38159760-b23e7c56-34e1-11e8-93e6-1076646b65dc.png)


Each frame is resized to 171 X 128 pixels

For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)

## Network architecture

Their deep networks use C3D as the basic archietecture in all stages .

cnv1a(64) - pool1(1,1) -

-conv2a(128) - pool2(2,2) -

-conv3a(256) - conv3b(256) - pool3(2,2) -

-conv4a(512) - conv4b(512) - pool4(2,2) - 

-conv5a(512) - conv5b(512) - pool5(2,2) - 

-fc6(4096) - fc7(4096) - fc8(K+1)

Each input for this deep network is a segment s of dimention 171 X 128 X 16 .

# Training procedure  

# And Impact of individual networks 

## Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)

## 1) The proposal network

![propose](https://user-images.githubusercontent.com/30540446/38159774-145f2f34-34e2-11e8-8dc0-89b5018b7b9c.png)


label k:{0,1}

For each segment of the trimmed video , set its label as positive.

For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )

![scnn propose](https://user-images.githubusercontent.com/30540446/38159778-379a39ee-34e2-11e8-9dc2-ff4ae32faa3b.png)


reduce the number of operations conducted on background  segments

## 2) The classification network


![classification](https://user-images.githubusercontent.com/30540446/38159781-4f6546e0-34e2-11e8-865f-46855dfb442a.png)

label: background and action k:{1....K}

In order to balance the number of training data for each class , this paper reduce the number of background instances to


![scnn 2](https://user-images.githubusercontent.com/30540446/38159784-6a107942-34e2-11e8-85e3-d37bbc126c97.png)

better pormance

## 3) The localization network 


![location](https://user-images.githubusercontent.com/30540446/38159787-77261524-34e2-11e8-8d24-aee475fea25f.png)


Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration 

The new loss function is formed by combining L（softmax） and L（overlap）：



![loss](https://user-images.githubusercontent.com/30540446/38159790-9d75cefe-34e2-11e8-9be6-fa03ad7effd7.png)

![scnn 2](https://user-images.githubusercontent.com/30540446/38159784-6a107942-34e2-11e8-85e3-d37bbc126c97.png)

better pormance









Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29) #2

Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs

Abstract

Detailed descriptions of Segment-CNN

Multi - scale segment generation

Network architecture

Training procedure

And Impact of individual networks

Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)

1) The proposal network

2) The classification network

3) The localization network

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29) #2

Description

Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs

Abstract

Detailed descriptions of Segment-CNN

Multi - scale segment generation

Network architecture

Training procedure

And Impact of individual networks

Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)

1) The proposal network

2) The classification network

3) The localization network

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions