Actor and Action Video Segmentation from a Sentence
https://arxiv.org/pdf/1803.07485.pdf
Textual Encoder
Word2Vec :
using the pretrained model on 'GoogleNews'
each words = 300 dimension vec
each sentence padding to have the same size (eg:15x300)
CNN:
Video Encoder
I3D Two-srteam
I3d last max-pooling layer --> average pooling over temporal dimention --> l2norm for each spatial position in feature map
-
ablation study:
49.5 for flow_only
53.6 for RGB_only
55.1for two-stream
tanh() is better
Decoding with dynamic filters

bottom up top down?

Actor and Action Video Segmentation from a Sentence
https://arxiv.org/pdf/1803.07485.pdf
Textual Encoder
Word2Vec :
using the pretrained model on 'GoogleNews'
each words = 300 dimension vec
each sentence padding to have the same size (eg:15x300)
CNN:
details:
temporal filter size = 2x2
channel = 300(same as word2vec representation)
ablation study:
51.8 for lstm
52.1 for bi-lstm
53.6 for cnn
Video Encoder
I3D Two-srteam
I3d last max-pooling layer --> average pooling over temporal dimention --> l2norm for each spatial position in feature map
ablation study:
49.5 for flow_only
53.6 for RGB_only
55.1for two-stream
tanh() is better
Decoding with dynamic filters
bottom up top down?