Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Traditional object detection is constrained by predefined categories and struggles to generalize to novel classes. Existing open-vocabulary approaches often fall short in multi-scale vision–language alignment and knowledge transfer. To address these limitations, this work proposes VLDet, a framework that reconstructs the feature pyramid to enable fine-grained, multi-level alignment between visual and textual representations. It introduces a novel VL-PUB module tailored for CLIP backbones and integrates a SigRPN with a Sigmoid-based anchor–text contrastive loss to effectively fuse vision–language knowledge. The method achieves 58.7 AP on novel categories of COCO2017 and 24.8 AP on LVIS, surpassing the current state of the art by 27.6% and 6.9%, respectively, while also demonstrating strong zero-shot detection performance in closed-set scenarios.

Technology Category

Application Category

📝 Abstract

Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we introduce the SigRPN block, which incorporates a sigmoid-based anchor-text contrastive alignment loss to improve detection of novel categories. Through extensive experiments, our approach achieves 58.7 AP for novel classes on COCO2017 and 24.8 AP on LVIS, surpassing all state-of-the-art methods and achieving significant improvements of 27.6% and 6.9%, respectively. Furthermore, VLDet also demonstrates superior zero-shot performance on closed-set object detection.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary object detection

visual-language alignment

novel class detection

object detection

zero-shot detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary object detection

visual-language alignment

feature pyramid