Skip to content

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning[2018-08-01] #3

@jessiSYJ

Description

@jessiSYJ

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning

https://arxiv.org/pdf/1804.05113.pdf

Segment Prososals

pf2mt2o u83asfasz jh29
Input:video V --> encodes all frames in V using C3D --> predicting a relative R(center,length) --> C3D for R

loss function:

cs 5f y3iou xb15 lrb8

Example1: R-C3D

[R-C3D: Region Convolutional 3D Network for Temporal Activity Detection.
https://arxiv.org/pdf/1703.07814.pdf]

7 iiz_g98 6 culhucx96

Example2 :

[Jointly Localizing and Describing Events for Dense Video Captioning.
https://arxiv.org/pdf/1804.08274.pdf]

q 0 1 fc_ lx1 8ie q e 4

Early Fusion

word-by-word fusion: using LSTM return a similarity

2a zcdry gpgxkmcw 50 m

Caption loss

9 v 19msmdyku u hjqb

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions