SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance limitations of existing RGB-infrared multimodal object detection methods in complex scenarios—such as high-contrast or nighttime conditions—stemming from cross-modal structural misalignment and static fusion mechanisms lacking environmental awareness. To overcome these challenges, we propose SLGNet, a parameter-efficient framework built upon a frozen Vision Transformer that jointly incorporates hierarchical structural priors and language-guided modulation driven by a vision-language model. Through a structure-aware adapter and dynamic feature recalibration, SLGNet enables environment-adaptive fusion without extensive retraining. Our method achieves state-of-the-art results on LLVIP, FLIR, KAIST, and DroneVehicle benchmarks, attaining an mAP of 66.1% on LLVIP while reducing trainable parameters by 87% compared to full fine-tuning.

Technology Category

Application Category

📝 Abstract
Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.
Problem

Research questions and friction points this paper is trying to address.

multimodal object detection
cross-modal structural consistency
environmental awareness
domain gap
static fusion mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal object detection
structure-aware adapter
language-guided modulation
Vision Transformer
parameter-efficient adaptation
🔎 Similar Papers
No similar papers found.
X
Xiantai Xiang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
Guangyao Zhou
Guangyao Zhou
Senior Research Scientist, Google DeepMind
Z
Zixiao Wen
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
W
Wenshuai Li
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
Ben Niu
Ben Niu
Dalian University of Technology; Shandong Normal University; Bohai University
Switched systemsstochastic systemsadaptive contolneural networkfuzzy control.
F
Feng Wang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
L
Lijia Huang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
Q
Qiantong Wang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China, also with the Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
Yuhan Liu
Yuhan Liu
Aerospace Information Research Institute (AIR), Chinese Academy of Sciences
Remote sensingImage processing
Zongxu Pan
Zongxu Pan
Xi'an Jiaotong University
Target detection and recognition in remote sensing images
Yuxin Hu
Yuxin Hu
Stanford University
Medical imagingMRIMachine learning