🤖 AI Summary
Existing urban forecasting models are task-specific, while prevailing geospatial foundation models suffer from limited modality support and ineffective multimodal fusion. Method: We propose the first multimodal geospatial foundation model for urban representation, unifying heterogeneous geodata—including street-view images, remote sensing imagery, vector maps, and points of interest (POIs). We design a stochastic multimodal fusion mechanism and a joint architecture comprising modality-specific encoders and Transformers, enhanced by contrastive learning for robust representation learning. The model supports arbitrary subsets of input modalities, ensuring strong generalization under data scarcity or cross-regional deployment. Contribution/Results: Evaluated across 56 cities and 41 housing price and public health prediction tasks, our model consistently outperforms state-of-the-art GeoAI baselines. It significantly improves positional encoding accuracy and cross-domain transfer performance, thereby overcoming dual bottlenecks in modality coverage and fusion paradigms for geospatial foundation models.
📝 Abstract
Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.