🤖 AI Summary
Multi-spectral remote sensing imagery poses significant challenges for fine-grained scene understanding in complex terrains (e.g., coastal zones, snow/cloud-covered regions), where conventional RGB-based methods suffer from severe information degradation. To address this, we propose Spectral LLaVA—a novel language-guided framework for multi-spectral scene understanding. It freezes the pre-trained SpectralGPT vision backbone and introduces only lightweight linear projection layers for efficient vision–language alignment. The framework jointly performs multi-spectral encoding, scene classification, and natural language description generation, and is fine-tuned on BigEarthNet v2. Experimental results demonstrate substantial improvements over RGB-only baselines: Spectral LLaVA achieves superior performance in both fine-grained land-cover classification and descriptive text generation—particularly excelling in RGB-degraded scenarios by producing more accurate, semantically richer, and physically grounded captions. This work establishes the first language-guided paradigm for multi-spectral remote sensing interpretation, enabling robust, modality-aware semantic reasoning beyond RGB limitations.
📝 Abstract
Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.