Test-time Vocabulary Adaptation for Language-driven Object Detection

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-vocabulary object detection, user-specified natural language category names often degrade performance due to semantic ambiguity or incorrectness. To address this, we propose VocAda—a zero-training, plug-and-play vocabulary adapter that dynamically refines user-provided categories at inference time based on image content. First, a pre-trained vision-language model (e.g., BLIP or LLaVA) generates an image caption; second, dependency parsing extracts salient noun phrases; third, cross-modal similarity matching selects the most semantically relevant candidate categories. VocAda requires no gradients, fine-tuning, or additional trainable parameters. Evaluated on COCO and Objects365 with three state-of-the-art detectors, it consistently improves mean average precision (AP) by 1.2–2.8 points, significantly enhancing robustness to ambiguous or erroneous category inputs and generalization across diverse vocabularies. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.
Problem

Research questions and friction points this paper is trying to address.

Refining overly broad user-defined vocabularies for object detection
Automatically tailoring vocabularies to relevant image categories
Improving detection performance without requiring training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play Vocabulary Adapter (VocAda)
Automatically refines user-defined vocabulary
Operates at inference time without training
🔎 Similar Papers
No similar papers found.