Evaluation Dataset

Hi! Thanks for presenting such a solid work! 

I am curious how your evaluation datasets are organized, e.g. how many videos are selected from the LOVEU-TGVE/UCF Sports Action data set, and how the original prompt/target prompts are defined. Could you share your organized evaluation dataset?