🤖 AI Summary
This work addresses the limited general object understanding in open-vocabulary object detection (OVOD) by proposing GW-VLM, the first training-free zero-shot detection method. GW-VLM synergistically leverages pre-trained vision-language models (VLMs) and large language models (LLMs) through a “guess-and-verify” mechanism: it employs multi-scale vision-language search (MS-VLS) to generate candidate image regions and uses contextual concept prompting (CCP) to guide the LLM in semantic reasoning. This paradigm requires no fine-tuning or additional training yet consistently outperforms state-of-the-art methods across multiple natural and remote sensing benchmarks—including COCO, Pascal VOC, DIOR, and NWPU-10—significantly advancing the performance and generalizability of zero-shot open-vocabulary detection.
📝 Abstract
Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of"guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.