Abstrace

This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test.
Model

Visual Encoder
For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip , for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.
Sentence Encoder
LSTM.
Off-the-shelf Skip-thought
Multi-modal processing module
The input dmension of the FC layer is 2*d and the output is d
Temporal Localization Regression Networks
Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).
Training
Loss function
We design a multi-task loss L

Sampling training examples

We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one
Charades-STA

-
split one sentence to some sub-sentences by a set of conjunctions
-
keywords maping
-
human check
Abstrace
This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test.
Model
Visual Encoder
For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip , for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.
Sentence Encoder
LSTM.
Off-the-shelf Skip-thought
Multi-modal processing module
The input dmension of the FC layer is 2*d and the output is d
Temporal Localization Regression Networks
Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).
Training
Loss function
We design a multi-task loss L
Sampling training examples
We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one
Charades-STA
split one sentence to some sub-sentences by a set of conjunctions
keywords maping
human check