Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses open-vocabulary object detection (OVD), aiming for zero-shot detection of unseen categories without auxiliary supervision such as image-text pairs or pseudo-labels, while mitigating semantic misalignment between upstream vision-language model (VLM) pretraining and downstream region-level perception. To this end, we propose an unsupervised cyclic contrastive knowledge transfer mechanism: it jointly optimizes query generation and region-level contrastive learning via dynamic interaction between language queries and visual region features; incorporates semantic priors to guide novel-category awareness; and enables efficient distillation of VLM’s visual-semantic space into the detector. The method exhibits consistent performance gains with increasing VLM scale—achieving +2.9% AP50 on COCO without a strong teacher and +10.2% AP50 with one—substantially surpassing state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
In pursuit of detecting unstinted objects that extend beyond predefined categories, prior arts of open-vocabulary object detection (OVD) typically resort to pretrained vision-language models (VLMs) for base-to-novel category generalization. However, to mitigate the misalignment between upstream image-text pretraining and downstream region-level perception, additional supervisions are indispensable, eg, image-text pairs or pseudo annotations generated via self-training strategies. In this work, we propose CCKT-Det trained without any extra supervision. The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from VLMs, which forces the detector to closely align with the visual-semantic space of VLMs. Specifically, 1) we prefilter and inject semantic priors to guide the learning of queries, and 2) introduce a regional contrastive loss to improve the awareness of queries on novel objects. CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of computation overhead. Comprehensive experimental results demonstrate that our method achieves performance gain of +2.9% and +10.2% AP50 over previous state-of-the-arts on the challenging COCO benchmark, both without and with a stronger teacher model. The code is provided at https://github.com/ZCHUHan/CCKT-Det.
Problem

Research questions and friction points this paper is trying to address.

Detects objects beyond predefined categories using vision-language models.
Aligns visual-semantic space without extra supervision or annotations.
Improves detection performance on novel objects with minimal computational overhead.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cyclic knowledge transfer without extra supervision
Prefiltered semantic priors guide query learning
Regional contrastive loss enhances novel object awareness
🔎 Similar Papers
No similar papers found.