about “3D dense captioning with ground truth bounding boxes”

hi~，I have a problem about using maskvotenet to get visual feature of GT bbox,In your code ,you just get One target object's feature,Do you konw how to get all GT bbox feature?of course，for Scan2Cap task,just need one  target object's feature,but aboout visual grounding task,we need all GT bbox feature.thank you~