Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address “fusion degradation”—a critical issue in multimodal object detection (MMOD) arising from weakened unimodal representation capacity—this paper proposes the M²D-LIF framework, which rethinks MMOD from a unimodal learning perspective. We introduce the first linear-probe-based quantitative evaluation method to assess unimodal representation capability. To strengthen unimodal feature learning, we design Mono-Modality Distillation (M²D), a novel distillation mechanism that enhances modality-specific feature discriminability. Furthermore, we propose Local Illumination-aware Fusion (LIF), a lightweight and robust RGB–IR feature fusion strategy that adaptively integrates complementary cues under varying illumination conditions. Evaluated on three mainstream MMOD benchmarks, M²D-LIF significantly mitigates fusion degradation and achieves comprehensive performance gains over existing state-of-the-art methods, setting new benchmarks in both accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.
Problem

Research questions and friction points this paper is trying to address.

Addresses insufficient learning in mono-modality feature extraction
Mitigates Fusion Degradation in multi-modal object detection
Proposes M$^2$D-LIF framework for improved RGB-IR object detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear probing evaluation for multi-modal detectors
Mono-Modality Distillation (M$^2$D) method
Local Illumination-aware Fusion (LIF) module
🔎 Similar Papers
No similar papers found.
Tianyi Zhao
Tianyi Zhao
University of Virginia
B
Boyang Liu
Beihang University
Y
Yanglei Gao
Beihang University
Y
Yiming Sun
Southeast University
M
Maoxun Yuan
Beihang University
Xingxing Wei
Xingxing Wei
Professor of Artificial Intelligence, Beihang University
Computer visionAdversarial machine learning