You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code specify a cutting-edge vision-language model designed to detect objects based on textual input. We leverage 512-dimensional embeddings to replace the standard classification output of the detector. Since the available dataset lacks object-level text annotations, we introduce a novel training approach utilizing prompts with category names to effectively train our object detector. Furthermore, to optimize performance for arbitrary text inputs, we employ knowledge distillation with the state-of-the-art CLIP model to transfer object embedding knowledge. The integration of these techniques results in a highly robust and efficient vision-language object detection system.