🤖 AI Summary
To address the insufficient robustness of RGB semantic segmentation under low-light and occlusion conditions, this paper proposes SGFNet, a spectrum-aware global multimodal fusion network. The method explicitly decouples RGB and thermal infrared features from a spectral perspective—separating high-frequency components (e.g., edges and textures) from low-frequency contextual information—and models their cross-modal interactions. A high-frequency-guided global attention mechanism is further introduced to synergistically enhance both structural details and semantic context. SGFNet is end-to-end trainable and achieves state-of-the-art performance on the MFNet and PST900 benchmarks, with significant improvements in segmentation accuracy and robustness under challenging environmental conditions. Key contributions include: (i) the first spectral decomposition framework for multimodal feature disentanglement in semantic segmentation; (ii) explicit high-frequency interaction modeling across modalities; and (iii) a novel global attention fusion strategy that jointly optimizes contextual coherence and fine-grained detail preservation.
📝 Abstract
Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.