Batch Inference (Captioning)

Thanks for your interesting work and for sharing the code.

In the README, you only provide examples of how to generate captions for one image at a time (batch size = 1). Could you (@Yushi-Hu)  explain how to generate captions in batches (multiple questions and corresponding images) in one go, instead of iteratively calling the model to improve time efficiency?