Skip to content

TALL:Temporal activity localization via language query(2018-03-28) #1

@jessiSYJ

Description

@jessiSYJ

Abstrace

1

This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test.

Model

_20180330105210

Visual Encoder

For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip , for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.

Sentence Encoder

LSTM.

Off-the-shelf Skip-thought

Multi-modal processing module

The input dmension of the FC layer is 2*d and the output is d

Temporal Localization Regression Networks

Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).

Training

Loss function

We design a multi-task loss L

qq 20180330110958

Sampling training examples

sampling

We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one

Charades-STA

dataset

  1. split one sentence to some sub-sentences by a set of conjunctions

  2. keywords maping

  3. human check

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions