🤖 AI Summary
This work addresses the challenge in open-vocabulary object detection where fine-grained recognition is often hindered by dominant category signals that suppress the binding of attributes—such as color and texture—to object instances. To mitigate this, the authors propose a two-stage attribute activation framework: first, an attribute prefix adapter injects explicit attribute priors during text embedding; second, a Key/Value modulation module enhances the attention representation of attribute-related tokens within the BERT encoding stage. Additionally, an attribute-aware contrastive loss is introduced to improve discrimination among instances of the same category but differing attributes. This approach uniquely integrates prefix-guided prompting with attention modulation in open-vocabulary detection, substantially strengthening attribute semantic representation and binding accuracy. Experiments on the FG-OVD benchmark demonstrate consistent and significant improvements in fine-grained detection performance across multiple state-of-the-art models.
📝 Abstract
Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the iden- tification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine- grained detection tasks involving attributes like color, ma- terial, and texture. We attribute this performance bottle- neck in OVD models to a core issue: when category sig- nals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect bind- ing between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capa- bilities by strengthening attribute semantics at two criti- cal stages. In the text embedding stage, we employ At- tribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further am- plify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encod- ing phase, selectively enhancing the Key and Value vec- tors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with differ- ent attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary mod- els.