CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Traditional object detectors suffer from class imbalance and label noise due to cross-entropy-based classification. To address this, we propose a detector-agnostic, end-to-end joint training framework that maps region/grid visual features into the CLIP text embedding space. Our method employs learnable class-specific text embeddings and lightweight parallel heads to jointly optimize contrastive loss (InfoNCE) and standard detection losses. This is the first work to unify vision-language contrastive supervision with object detection in a single end-to-end model, seamlessly integrating with both two-stage (e.g., Faster R-CNN) and one-stage (e.g., YOLOv11) architectures. Experiments on Pascal VOC and MS COCO demonstrate significant improvements in detection accuracy while maintaining real-time inference speed. Extensive cross-architecture and cross-dataset evaluations confirm strong robustness and consistent, substantial gains over baselines in closed-set detection performance.

Technology Category

Application Category

📝 Abstract

Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.

Problem

Research questions and friction points this paper is trying to address.

Improves object detection robustness to class imbalance and label noise

Integrates CLIP-style vision-language supervision into object detectors

Enhances detection performance across architectures while maintaining speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates CLIP-style contrastive vision-language supervision via joint training

Uses lightweight parallel head with InfoNCE loss and cross-entropy

Applies to both two-stage and one-stage detector architectures seamlessly

🔎 Similar Papers

No similar papers found.