GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection

๐Ÿ“… 2026-03-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing open-vocabulary object detection methods often suffer from localization bias and semantic drift in fine-grained scenarios due to the entanglement of subject and attribute semantics in vision-language models. To address this, this work proposes GUIDED, a novel framework that decouples the task into two distinct sub-pathways: subject localization and attribute recognition. Specifically, a language model parses category names to isolate the subject, which solely guides object localization to ensure spatial stability, while an attention mechanism selectively integrates beneficial attributes. Furthermore, a region-level visionโ€“language alignment module with a projection head is introduced to enhance fine-grained discrimination. Evaluated on the FG-OVD and 3F-OVD benchmarks, GUIDED achieves state-of-the-art performance, significantly improving both detection accuracy and robustness.
๐Ÿ“ Abstract
Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.
Problem

Research questions and friction points this paper is trying to address.

fine-grained open-vocabulary object detection
semantic entanglement
attribute over-representation
mislocalization
vision-language model
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained open-vocabulary object detection
semantic disentanglement
attribute embedding fusion
modular decomposition
vision-language alignment
๐Ÿ”Ž Similar Papers
No similar papers found.
Jiaming Li
Jiaming Li
University of Chinese Academy of Sciences
NLPAligenmentGenerationInterpretability
Z
Zhijia Liang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Weikai Chen
Weikai Chen
Principal Research Scientist, Tencent America
3D AIGC3D VisionComputer graphicsVLM
Lin Ma
Lin Ma
Meituan
Multimodal LLMComputer Vision
G
Guanbin Li
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; GuangDong Province Key Laboratory of Information Security Technology, China; Research Institute, Sun Yat-sen University, Shenzhen, China