TALL:Temporal activity localization via language query(2018-03-28)

# Abstrace

![1](https://user-images.githubusercontent.com/30540446/38123512-fb7ea0b4-340d-11e8-860b-c02bc6d9c286.png)


This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test. 

# Model

![_20180330105210](https://user-images.githubusercontent.com/30540446/38123516-028ff236-340e-11e8-804f-a7d40627fcff.png)



## Visual Encoder

For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip ,  for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.

## Sentence Encoder

LSTM.  

Off-the-shelf Skip-thought

## Multi-modal processing module

The input dmension of the FC layer is 2*d and the output is d

## Temporal Localization Regression Networks

Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).

# Training

## Loss function

We design a multi-task loss L 

![qq 20180330110958](https://user-images.githubusercontent.com/30540446/38123528-1905a290-340e-11e8-9587-bde55a2b843b.png)


## Sampling training examples

![sampling](https://user-images.githubusercontent.com/30540446/38123590-93747eac-340e-11e8-92c4-2f7c7d326b1e.png)


We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one 


# Charades-STA

![dataset](https://user-images.githubusercontent.com/30540446/38123612-a89c89aa-340e-11e8-8be1-a226c5442926.png)


1) split one sentence to some sub-sentences by a set of conjunctions 

2) keywords maping 

3) human check  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TALL:Temporal activity localization via language query(2018-03-28) #1

Abstrace

Model

Visual Encoder

Sentence Encoder

Multi-modal processing module

Temporal Localization Regression Networks

Training

Loss function

Sampling training examples

Charades-STA

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

TALL:Temporal activity localization via language query(2018-03-28) #1

Description

Abstrace

Model

Visual Encoder

Sentence Encoder

Multi-modal processing module

Temporal Localization Regression Networks

Training

Loss function

Sampling training examples

Charades-STA

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions