Open-Det: An Efficient Learning Framework for Open-Ended Detection

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges in open-vocabulary object detection (OVD)—including heavy reliance on large-scale annotated datasets, slow convergence, and performance bottlenecks—this paper proposes the first end-to-end vocabulary-free OVD framework. Methodologically: (1) we design a four-module collaborative architecture integrating a reconstruction-based detector and an object name generator; (2) we introduce the first vision–language aligner and prompt distillation mechanism to enable bidirectional V-to-L and L-to-V alignment; (3) we propose masked alignment loss and joint classification loss to mitigate modality gaps and supervision conflicts. Experiments demonstrate that our method achieves a 1.0% higher APr than GenerateU, despite using only 1.5% of the training data, 20.8% of the training epochs, and significantly fewer GPU resources—thereby substantially improving both training efficiency and detection accuracy.

Technology Category

Application Category

📝 Abstract
Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.
Problem

Research questions and friction points this paper is trying to address.

Improves open-ended object detection efficiency and performance
Reduces training data and resource requirements significantly
Bridges vision-language gap for accurate object name generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs detector and name generator
Proposes Vision-Language Aligner with alignment mechanisms
Introduces Masked Alignment and Joint Loss
🔎 Similar Papers
No similar papers found.
Guiping Cao
Guiping Cao
PCL; SUSTech; CVTE Research; XJTU
Deep LearningComputer VisionMedical Image Processing
Wenjian Huang
Wenjian Huang
Peking University
BioMedical Image&Signal ProcessingMachine LearningArtificial IntelligenceStatistical LearningComputer Vision
Xiangyuan Lan
Xiangyuan Lan
Pengcheng Laboratory
Multimodal LLMPlace RecognitionVisual TrackingPerson Re-identificationObject Detection
J
Jianguo Zhang
Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China; Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China; Pengcheng Laboratory, Shenzhen, China
Dongmei Jiang
Dongmei Jiang
Northwestern Polytechnical University; Peng Cheng Laboratory
Affective ComputingMultimodal emotion recognitionMultimodal mental state evaluation