A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited general object understanding in open-vocabulary object detection (OVOD) by proposing GW-VLM, the first training-free zero-shot detection method. GW-VLM synergistically leverages pre-trained vision-language models (VLMs) and large language models (LLMs) through a “guess-and-verify” mechanism: it employs multi-scale vision-language search (MS-VLS) to generate candidate image regions and uses contextual concept prompting (CCP) to guide the LLM in semantic reasoning. This paradigm requires no fine-tuning or additional training yet consistently outperforms state-of-the-art methods across multiple natural and remote sensing benchmarks—including COCO, Pascal VOC, DIOR, and NWPU-10—significantly advancing the performance and generalizability of zero-shot open-vocabulary detection.

Technology Category

Application Category

📝 Abstract
Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of"guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Object Detection
Universal Understanding
Vision Language Model
Training-Free
Object Cognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free
Open-Vocabulary Object Detection
Vision Language Model
Multi-Scale Visual Language Searching
Contextual Concept Prompt
🔎 Similar Papers
No similar papers found.
G
Guiying Zhu
Aerospace and Informatics Domain, Beijing Institute of Technology, Zhuhai, China
B
Bowen Yang
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, School of Information & Electronics, Beijing Institute of Technology, Beijing, China
Z
Zhuang Yin
Aerospace and Informatics Domain, Beijing Institute of Technology, Zhuhai, China
Tong Zhang
Tong Zhang
Peking University
NLP
G
Guanqun Wang
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, School of Information & Electronics, Beijing Institute of Technology, Beijing, China
Z
Zhihao Che
Aerospace and Informatics Domain, Beijing Institute of Technology, Zhuhai, China
He Chen
He Chen
Chinese University of Hong Kong
Mathematical Programming
L
Lianlin Li
School of Electronic, Peking University, Beijing, China