Hi, thanks a lot for sharing your solid work, I have learned much from your paper and code. Here I still have a question about the part of temporal modeling.
I saw that you have compared the performance between Timesformer and XCLIP, which show that Timesformer works better, but in the paper of XCLIP, it used pretrained CLIP weights, and XCLIP found a trade-off way between keeping performance of pretrained CLIP weights and Temporal modeling.
I want to ask if you have test the performance of using XCLIP with pretrained CLIP, and did you found the way to used both Timesformer's temporal modeling and CLIP pretrained weights, which I think will beat XCLIP in theory. 😊
Hi, thanks a lot for sharing your solid work, I have learned much from your paper and code. Here I still have a question about the part of temporal modeling.
I saw that you have compared the performance between Timesformer and XCLIP, which show that Timesformer works better, but in the paper of XCLIP, it used pretrained CLIP weights, and XCLIP found a trade-off way between keeping performance of pretrained CLIP weights and Temporal modeling.
I want to ask if you have test the performance of using XCLIP with pretrained CLIP, and did you found the way to used both Timesformer's temporal modeling and CLIP pretrained weights, which I think will beat XCLIP in theory. 😊