UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing urban forecasting models are task-specific, while prevailing geospatial foundation models suffer from limited modality support and ineffective multimodal fusion. Method: We propose the first multimodal geospatial foundation model for urban representation, unifying heterogeneous geodata—including street-view images, remote sensing imagery, vector maps, and points of interest (POIs). We design a stochastic multimodal fusion mechanism and a joint architecture comprising modality-specific encoders and Transformers, enhanced by contrastive learning for robust representation learning. The model supports arbitrary subsets of input modalities, ensuring strong generalization under data scarcity or cross-regional deployment. Contribution/Results: Evaluated across 56 cities and 41 housing price and public health prediction tasks, our model consistently outperforms state-of-the-art GeoAI baselines. It significantly improves positional encoding accuracy and cross-domain transfer performance, thereby overcoming dual bottlenecks in modality coverage and fusion paradigms for geospatial foundation models.

Technology Category

Application Category

📝 Abstract
Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.
Problem

Research questions and friction points this paper is trying to address.

Integrates diverse geospatial data for urban forecasting tasks
Overcomes limited modality support in spatial foundation models
Enables flexible multimodal fusion across varying data availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Multimodal Fusion integrates diverse geospatial data
Transformer-based module learns unified spatial representations
Flexible modality subset usage during pretraining and inference
🔎 Similar Papers
No similar papers found.