Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
Existing approaches struggle to effectively leverage textual semantics to bridge the granularity asymmetry between RGB and infrared modalities and often overlook discriminative cross-modal discrepancies. To address these limitations, this work proposes a semantic bridging fusion framework that employs text as a shared semantic anchor to align bimodal responses. The method introduces an innovative bidirectional modeling mechanism that jointly captures both consensus and divergence across modalities, complemented by dynamic recalibration as a structured inductive bias. Evaluated on multiple multispectral object detection benchmarks, the proposed approach achieves state-of-the-art performance, significantly enhancing the effectiveness and robustness of multimodal fusion.

Technology Category

Application Category

📝 Abstract
Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.
Problem

Research questions and friction points this paper is trying to address.

multispectral object detection
RGB-IR gap
text-guided fusion
cross-modal discrepancy
semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided fusion
multispectral object detection
consensus and discrepancy modeling
semantic bridge
cross-modal alignment
🔎 Similar Papers
2024-03-22IEEE transactions on circuits and systems for video technology (Print)Citations: 2
J
Jiaqi Wu
Department of Automation, Tsinghua University
Z
Zhen Wang
School of Artificial Intelligence, China University of Mining Technology - Beijing
E
Enhao Huang
State Key Laboratory of Blockchain and Data Security, Zhejiang University
K
Kangqing Shen
Department of Automation, Tsinghua University
Yulin Wang
Yulin Wang
Shanghai Jiao Tong University
Y
Yang Yue
Department of Automation, Tsinghua University
Y
Yifan Pu
Tsinghua University
G
Gao Huang
Department of Automation, Tsinghua University