๐ค AI Summary
To address the challenge of unified modeling for multi-source heterogeneous remote sensing data, this paper introduces the first vision-language foundation model that employs natural language as a unifying modality for cross-modal semantic alignment and joint understanding. We propose a modality-aware knowledge aggregation (MaKA) module and a progressive multimodal weight fusion strategy, enabling flexible support for remote sensing inputs with arbitrary numbers of spectral bands, zero-shot cross-modal understanding, and fine-grained visual parsing. Leveraging our newly constructed six-modal remote sensing image-text dataset, GeoLangBind-2M (2 million samples), the model integrates contrastive learning, cross-modal alignment, and parameter-efficient aggregation. It achieves state-of-the-art performance across 23 remote sensing benchmarks, significantly advancing zero-shot classification, cross-modal retrieval, and land-cover parsingโthereby establishing a highly generalizable foundation model for environmental monitoring and related applications.
๐ Abstract
Earth observation (EO) data, collected from diverse sensors with varying imaging principles, present significant challenges in creating unified analytical frameworks. We present GeoLangBind, a novel agglomerative vision--language foundation model that bridges the gap between heterogeneous EO data modalities using language as a unifying medium. Our approach aligns different EO data types into a shared language embedding space, enabling seamless integration and complementary feature learning from diverse sensor data. To achieve this, we construct a large-scale multimodal image--text dataset, GeoLangBind-2M, encompassing six data modalities. GeoLangBind leverages this dataset to develop a zero-shot foundation model capable of processing arbitrary numbers of EO data channels as input. Through our designed Modality-aware Knowledge Agglomeration (MaKA) module and progressive multimodal weight merging strategy, we create a powerful agglomerative foundation model that excels in both zero-shot vision--language comprehension and fine-grained visual understanding. Extensive evaluation across 23 datasets covering multiple tasks demonstrates GeoLangBind's superior performance and versatility in EO applications, offering a robust framework for various environmental monitoring and analysis tasks. The dataset and pretrained models will be publicly available.