Skip to content

[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29) #2

@jessiSYJ

Description

@jessiSYJ

Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs

Abstract

This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:

(1) a proposal network -- identifies candidate segments in a long video that may contain actions

(2) a classification network (very important for training) -- serve as initialization for the localization network

(3) a localization network -- fine-tunes the learned classification network to localize each action instance

Detailed descriptions of Segment-CNN

model

Multi - scale segment generation

multi-scale segment generation

Each frame is resized to 171 X 128 pixels

For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)

Network architecture

Their deep networks use C3D as the basic archietecture in all stages .

cnv1a(64) - pool1(1,1) -

-conv2a(128) - pool2(2,2) -

-conv3a(256) - conv3b(256) - pool3(2,2) -

-conv4a(512) - conv4b(512) - pool4(2,2) -

-conv5a(512) - conv5b(512) - pool5(2,2) -

-fc6(4096) - fc7(4096) - fc8(K+1)

Each input for this deep network is a segment s of dimention 171 X 128 X 16 .

Training procedure

And Impact of individual networks

Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)

1) The proposal network

propose

label k:{0,1}

For each segment of the trimmed video , set its label as positive.

For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )

scnn propose

reduce the number of operations conducted on background segments

2) The classification network

classification

label: background and action k:{1....K}

In order to balance the number of training data for each class , this paper reduce the number of background instances to

scnn 2

better pormance

3) The localization network

location

Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration

The new loss function is formed by combining L(softmax) and L(overlap):

loss

scnn 2

better pormance

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions