CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address gradient dilution of rare positive classes and suppression of discriminative signals from hard negative samples by easy negatives in large-vocabulary object detection, this paper proposes a novel contrastive learning-based category query paradigm. Methodologically: (1) it reformulates category classification as a contrastive matching task between object queries and learnable category queries; (2) it introduces an image-guided dynamic Top-K category retrieval mechanism to rebalance gradients and implicitly mine hard examples; (3) it integrates cross-attention-driven image-guided query selection, self-attention modeling of hierarchical and semantic category relationships, and a contrastive query matching loss. The approach achieves a +2.1% AP improvement on V3Det, substantially outperforming prior methods, while maintaining competitive performance on COCO—demonstrating strong scalability and generalization for large-scale category detection.

Technology Category

Application Category

📝 Abstract
With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The dataset and code will be publicly at https://github.com/RedAIGC/CQ-DINO.
Problem

Research questions and friction points this paper is trying to address.

Addresses gradient dilution in vast vocabulary object detection
Reformulates classification as contrastive task using category queries
Improves performance on large-scale datasets like V3Det and COCO
Innovation

Methods, ideas, or system contributions that make the work stand out.

Category query-based contrastive object detection framework
Image-guided top-K category selection via cross-attention
Flexible hierarchical or self-learned category relationships integration
🔎 Similar Papers
No similar papers found.
Z
Zhichao Sun
School of Computer Science, Wuhan University; Xiaohongshu Inc.
Huazhang Hu
Huazhang Hu
ShanghaiTech University
Computer VisionDeep Learning
Y
Yidong Ma
Xiaohongshu Inc.
G
Gang Liu
Xiaohongshu Inc.
N
Nemo Chen
Xiaohongshu Inc.
X
Xu Tang
Xiaohongshu Inc.
Y
Yongchao Xu
School of Computer Science, Wuhan University