Actor and Action Video Segmentation from a Sentence [2018-08-02]

# Actor and Action Video Segmentation from a Sentence
https://arxiv.org/pdf/1803.07485.pdf

## Textual Encoder


#### Word2Vec :

using the pretrained model on 'GoogleNews'

each words = 300 dimension vec

each sentence padding to have the same size (eg:15x300)

#### CNN:

* details:

  temporal filter size = 2x2
  
  channel = 300(same as word2vec representation)

* ablation study:

  51.8 for lstm
  
  52.1 for bi-lstm
  
  53.6 for cnn

## Video Encoder

#### I3D  Two-srteam

* detials:

 I3d last max-pooling layer --> average pooling over temporal dimention --> l2norm for each spatial position in feature map

* ablation study:

  49.5 for flow_only
  
  53.6 for RGB_only   
  
  55.1for two-stream
  
##### tanh() is better
## Decoding with dynamic filters

![r676tcmbbu 6kuqfd l j i](https://user-images.githubusercontent.com/30540446/43572378-f32f3204-9671-11e8-9feb-36f5c42ccd3d.png)


bottom up top down?

![fgcts t47 sjj_6i aeoaay](https://user-images.githubusercontent.com/30540446/43572338-dada6e94-9671-11e8-8801-02a7f240fe00.png)



  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor and Action Video Segmentation from a Sentence [2018-08-02] #4

Actor and Action Video Segmentation from a Sentence

Textual Encoder

Word2Vec :

CNN:

Video Encoder

I3D Two-srteam

tanh() is better

Decoding with dynamic filters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Actor and Action Video Segmentation from a Sentence [2018-08-02] #4

Description

Actor and Action Video Segmentation from a Sentence

Textual Encoder

Word2Vec :

CNN:

Video Encoder

I3D Two-srteam

tanh() is better

Decoding with dynamic filters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions