🤖 AI Summary
To address the lack of a unified single-stage framework, insufficient cross-modal fusion, imbalanced modality weighting, and poor low-light robustness in RGB-T multispectral object detection, this paper proposes the first unified single-stage detection framework tailored for RGB-T imagery. We introduce a novel P3 mid-level feature fusion strategy and Multispectral Controllable Fine-tuning (MCF), enabling six transferable fusion patterns that dynamically reassess modality importance and optimize weights. Built upon the YOLOv11 backbone, our framework supports feature-level multimodal fusion, controllable parameter adaptation, and cross-model generalization—compatible with YOLOv3–v12 and RT-DETR. Extensive experiments on LLVIP and FLIR demonstrate state-of-the-art performance: on FLIR, our method achieves a maximum mAP of 47.61%, representing a relative improvement of 3.41%–5.65% over prior works. The source code is publicly available.
📝 Abstract
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.