Text-to-clip Video Retrieval with Early Fusion and Re-Captioning
https://arxiv.org/pdf/1804.05113.pdf
Segment Prososals

Input:video V --> encodes all frames in V using C3D --> predicting a relative R(center,length) --> C3D for R
loss function:

Example1: R-C3D
[R-C3D: Region Convolutional 3D Network for Temporal Activity Detection.
https://arxiv.org/pdf/1703.07814.pdf]

Example2 :
[Jointly Localizing and Describing Events for Dense Video Captioning.
https://arxiv.org/pdf/1804.08274.pdf]

Early Fusion
word-by-word fusion: using LSTM return a similarity

Caption loss

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning
https://arxiv.org/pdf/1804.05113.pdf
Segment Prososals
Input:video V --> encodes all frames in V using C3D --> predicting a relative R(center,length) --> C3D for R
loss function:
Example1: R-C3D
[R-C3D: Region Convolutional 3D Network for Temporal Activity Detection.
https://arxiv.org/pdf/1703.07814.pdf]
Example2 :
[Jointly Localizing and Describing Events for Dense Video Captioning.
https://arxiv.org/pdf/1804.08274.pdf]
Early Fusion
word-by-word fusion: using LSTM return a similarity
Caption loss