A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training

📅 2024-08-20

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing open-vocabulary detection (OVD) methods rely on full-parameter fine-tuning of vision-language backbones, incurring high computational costs; parameter-efficient alternatives face challenges in adapter layer selection and accuracy-efficiency trade-offs. This paper proposes LightMDETR: it freezes the MDETR backbone and trains only a lightweight, modality-switchable Universal Projection (UP) module and learnable modality tokens, enabling region-level vision-language joint adaptation. The UP module—introduced for the first time—achieves cross-modal alignment without adding any parameters to the backbone. Evaluated on phrase grounding, referring expression comprehension, and segmentation, LightMDETR surpasses multiple state-of-the-art methods, reducing model parameters by 60.3%, accelerating inference by 2.3×, and improving mean Average Precision (mAP) by 1.7%.

Technology Category

Application Category

📝 Abstract

Object detection is a fundamental challenge in computer vision, centered on recognizing objects within images, with diverse applications in areas like image analysis, robotics, and autonomous vehicles. Although existing methods have achieved great success, they are often constrained by a fixed vocabulary of objects. To overcome this limitation, approaches like MDETR have redefined object detection by incorporating region-level vision-language pre-training, enabling open-vocabulary object detectors. However, these methods are computationally heavy due to the simultaneous training of large models for both vision and language representations. To address this, we introduce a lightweight framework that significantly reduces the number of parameters while preserving, or even improving, performance. Our solution is applied to MDETR, resulting in the development of Lightweight MDETR (LightMDETR), an optimized version of MDETR designed to enhance computational efficiency without sacrificing accuracy. The core of our approach involves freezing the MDETR backbone and training only the Universal Projection module (UP), which bridges vision and language representations. A learnable modality token parameter allows the UP to seamlessly switch between modalities. Evaluations on tasks like phrase grounding, referring expression comprehension, and segmentation show that LightMDETR not only reduces computational costs but also outperforms several state-of-the-art methods in terms of accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reducing high training costs of open-vocabulary detection models

Selecting optimal layers for adaptation while balancing efficiency

Achieving parameter-efficient tuning without sacrificing detection accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes pretrained backbones to reduce trainable parameters

Introduces Universal Projection module with learnable modality token

Enables unified vision-language adaptation at minimal cost

🔎 Similar Papers

No similar papers found.