Skip to content

KingsleyXW/Vision-Language-Model-for-Text-object-Retrieval

Repository files navigation

The code specify a cutting-edge vision-language model designed to detect objects based on textual input. We leverage 512-dimensional embeddings to replace the standard classification output of the detector. Since the available dataset lacks object-level text annotations, we introduce a novel training approach utilizing prompts with category names to effectively train our object detector. Furthermore, to optimize performance for arbitrary text inputs, we employ knowledge distillation with the state-of-the-art CLIP model to transfer object embedding knowledge. The integration of these techniques results in a highly robust and efficient vision-language object detection system.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors