YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of a unified single-stage framework, insufficient cross-modal fusion, imbalanced modality weighting, and poor low-light robustness in RGB-T multispectral object detection, this paper proposes the first unified single-stage detection framework tailored for RGB-T imagery. We introduce a novel P3 mid-level feature fusion strategy and Multispectral Controllable Fine-tuning (MCF), enabling six transferable fusion patterns that dynamically reassess modality importance and optimize weights. Built upon the YOLOv11 backbone, our framework supports feature-level multimodal fusion, controllable parameter adaptation, and cross-model generalization—compatible with YOLOv3–v12 and RT-DETR. Extensive experiments on LLVIP and FLIR demonstrate state-of-the-art performance: on FLIR, our method achieves a maximum mAP of 47.61%, representing a relative improvement of 3.41%–5.65% over prior works. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.
Problem

Research questions and friction points this paper is trying to address.

Lack of unified single-stage multispectral object detection framework
Difficulty balancing performance and fusion strategy in multispectral detection
Unreasonable modality weight allocation in existing multispectral models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Six multispectral fusion modes designed
P3 mid-fusion strategy proposed
Multispectral controllable fine-tuning strategy introduced
🔎 Similar Papers
No similar papers found.
D
Dahang Wan
School of Instrument Science and Opto-electronics Engineering, Anhui Province Key Laboratory of Measuring Theory and Precision Instrument, Hefei University of Technology, Hefei 230009, China
R
Rongsheng Lu
School of Instrument Science and Opto-electronics Engineering, Anhui Province Key Laboratory of Measuring Theory and Precision Instrument, Hefei University of Technology, Hefei 230009, China
Y
Yang Fang
School of Instrument Science and Opto-electronics Engineering, Anhui Province Key Laboratory of Measuring Theory and Precision Instrument, Hefei University of Technology, Hefei 230009, China
X
Xianli Lang
School of Instrument Science and Opto-electronics Engineering, Anhui Province Key Laboratory of Measuring Theory and Precision Instrument, Hefei University of Technology, Hefei 230009, China
S
Shuangbao Shu
School of Instrument Science and Opto-electronics Engineering, Anhui Province Key Laboratory of Measuring Theory and Precision Instrument, Hefei University of Technology, Hefei 230009, China
Jingjing Chen
Jingjing Chen
Fudan University
MultimediaComputer VisionMachine LearningPattern recognition
Siyuan Shen
Siyuan Shen
School of Information Science and Technology, ShanghaiTech University
Computer visionComputational photography
T
Ting Xu
School of Instrument Science and Opto-electronics Engineering, Anhui Province Key Laboratory of Measuring Theory and Precision Instrument, Hefei University of Technology, Hefei 230009, China
Z
Zecong Ye
School of Information Engineering, Engineering University of PAP, Xi’an 710086, China