🤖 AI Summary
Existing multimodal image fusion methods under adverse weather conditions suffer from visual information loss and fail to effectively leverage textual cues for enhanced semantic perception. Method: This paper proposes a text-aware vision-language collaborative fusion framework. It introduces hierarchical text supervision by employing BLIP to generate global scene captions and ChatGPT to produce local fine-grained descriptions. A unified, weight-sharing network architecture is designed, embedding text constraints into both feature extraction and reconstruction stages to enable semantic-guided degradation modeling and detail restoration. Contribution/Results: The method significantly improves fusion quality across diverse weather degradations (e.g., fog, rain, snow) and achieves state-of-the-art performance on downstream tasks such as object detection. Experimental results validate that cross-modal semantic alignment plays a critical role in visual restoration under challenging environmental conditions.
📝 Abstract
Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse.