🤖 AI Summary
Existing hyperspectral image (HSI) semantic segmentation methods, built upon RGB-optimized architectures, struggle to model joint spectral–spatial features, leading to suboptimal performance in complex scenes. To address this, we propose a lightweight adapter framework that leverages a frozen pre-trained vision transformer as the backbone. Our approach introduces a spectral transformer to capture long-range inter-channel spectral dependencies, a spectral-aware spatial prior module to enhance local structural modeling, and a modality-aware interaction block for cross-modal feature alignment and injection. Crucially, the vision backbone remains fixed—no fine-tuning is required—yet HSI representation capability is significantly improved. Evaluated on three autonomous-driving HSI benchmarks, our method achieves state-of-the-art performance, substantially outperforming both RGB-based baselines and existing HSI-specific approaches. Results demonstrate strong robustness and generalization in real-world driving scenarios.
📝 Abstract
Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hyperspectraladapter.cs.uni-freiburg.de.