Hi,
Noticed that there are known issues with the Emu edit benchmark: some image-caption pairs seem incorrect (e.g., 'a train station in city') or identical source and target captions. So I was wondering how to calculate clip_dir metric. How did you process the benchmark dataset?
Looking forward to your reply.
Hi,
Noticed that there are known issues with the Emu edit benchmark: some image-caption pairs seem incorrect (e.g., 'a train station in city') or identical source and target captions. So I was wondering how to calculate clip_dir metric. How did you process the benchmark dataset?
Looking forward to your reply.