🤖 AI Summary
Traditional object detectors suffer from class imbalance and label noise due to cross-entropy-based classification. To address this, we propose a detector-agnostic, end-to-end joint training framework that maps region/grid visual features into the CLIP text embedding space. Our method employs learnable class-specific text embeddings and lightweight parallel heads to jointly optimize contrastive loss (InfoNCE) and standard detection losses. This is the first work to unify vision-language contrastive supervision with object detection in a single end-to-end model, seamlessly integrating with both two-stage (e.g., Faster R-CNN) and one-stage (e.g., YOLOv11) architectures. Experiments on Pascal VOC and MS COCO demonstrate significant improvements in detection accuracy while maintaining real-time inference speed. Extensive cross-architecture and cross-dataset evaluations confirm strong robustness and consistent, substantial gains over baselines in closed-set detection performance.
📝 Abstract
Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.