🤖 AI Summary
In open-vocabulary object detection, user-specified natural language category names often degrade performance due to semantic ambiguity or incorrectness. To address this, we propose VocAda—a zero-training, plug-and-play vocabulary adapter that dynamically refines user-provided categories at inference time based on image content. First, a pre-trained vision-language model (e.g., BLIP or LLaVA) generates an image caption; second, dependency parsing extracts salient noun phrases; third, cross-modal similarity matching selects the most semantically relevant candidate categories. VocAda requires no gradients, fine-tuning, or additional trainable parameters. Evaluated on COCO and Objects365 with three state-of-the-art detectors, it consistently improves mean average precision (AP) by 1.2–2.8 points, significantly enhancing robustness to ambiguous or erroneous category inputs and generalization across diverse vocabularies. The code is publicly available.
📝 Abstract
Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.