🤖 AI Summary
Existing geospatial representation learning methods lack generality, limiting their applicability across diverse domains and tasks. To address this, we propose the first nationwide general-purpose location encoder, integrating points-of-interest (POI), remote sensing imagery, demographic statistics, and billion-scale mobile trajectories. Leveraging a Vision Transformer–inspired spatial gridding strategy, it unifies human activity and natural geographic features into a coherent representation. We further introduce a novel multimodal geospatial CLIP alignment framework and establish GeoBench—a comprehensive benchmark comprising 11 cross-domain evaluation tasks spanning social, economic, and environmental domains—while empirically uncovering the scaling law of geospatial representations for the first time. Our method synergistically combines spatial tokenization, multimodal contrastive learning, graph neural networks, and remote sensing encoding. It achieves an average 35% improvement across all 11 tasks, with particularly notable gains in energy consumption prediction (+260%), retail consumption forecasting (+98%), and crime prediction (+95%).
📝 Abstract
Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence. Current embedding methods often lack versatility, limiting their utility across diverse tasks in both human and natural domains. We present MobCLIP, the first nationwide general-purpose location encoder, integrating an unprecedented diversity of data modalities through effective and scalable multimodal fusion. Adopting a novel CLIP-based architecture, our framework aligns 100M+ POIs, nationwide remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph. By tokenizing spatial locations into grid cells inspired by Vision Transformers, we establish a unified representation space bridging mobility patterns and multimodal features. To rigorously evaluate the general-purpose effectiveness of MobCLIP, we construct a benchmark dataset composed of 11 downstream prediction tasks across social, economic, and natural domains. Experiments show that MobCLIP, with four input modalities and a compact 128-dimensional representation space, achieves significantly superior general-purpose predictive performances than state-of-the-art models by an average of 35%. Thanks to the effective integration of human-centric modalities, the performance gain is particularly profound in human-centric tasks, such as energy consumption (+260%), offline retail consumption amount (+98%), and crime cases (+95%) predictions. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: github.com.