๐ค AI Summary
Existing open-vocabulary object detection methods often suffer from localization bias and semantic drift in fine-grained scenarios due to the entanglement of subject and attribute semantics in vision-language models. To address this, this work proposes GUIDED, a novel framework that decouples the task into two distinct sub-pathways: subject localization and attribute recognition. Specifically, a language model parses category names to isolate the subject, which solely guides object localization to ensure spatial stability, while an attention mechanism selectively integrates beneficial attributes. Furthermore, a region-level visionโlanguage alignment module with a projection head is introduced to enhance fine-grained discrimination. Evaluated on the FG-OVD and 3F-OVD benchmarks, GUIDED achieves state-of-the-art performance, significantly improving both detection accuracy and robustness.
๐ Abstract
Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.