🤖 AI Summary
Traditional spatial embedding methods over-rely on structural proximity while neglecting fine-grained semantic cues—particularly POI names—leading to inadequate characterization of urban structure and semantics. To address this, we propose CaLLiPer+, the first model to jointly encode raw POI names and categorical labels via multimodal contrastive learning, integrating pre-trained textual representations (BERT) with graph-structured spatial embeddings. This explicitly captures complementary semantics between names and categories. Our key contributions are: (i) the systematic introduction of POI names as a primary semantic signal for spatial representation; and (ii) empirical validation that pre-trained language models yield substantial gains in spatial representation quality. Evaluated on two downstream tasks—land-use classification and socioeconomic distribution mapping—CaLLiPer+ achieves 4–11% performance improvements. It also significantly enhances location retrieval accuracy and modeling capability for complex urban concepts.
📝 Abstract
Spatial representations that capture both structural and semantic characteristics of urban environments are essential for urban modeling. Traditional spatial embeddings often prioritize spatial proximity while underutilizing fine-grained contextual information from places. To address this limitation, we introduce CaLLiPer+, an extension of the CaLLiPer model that systematically integrates Point-of-Interest (POI) names alongside categorical labels within a multimodal contrastive learning framework. We evaluate its effectiveness on two downstream tasks, land use classification and socioeconomic status distribution mapping, demonstrating consistent performance gains of 4% to 11% over baseline methods. Additionally, we show that incorporating POI names enhances location retrieval, enabling models to capture complex urban concepts with greater precision. Ablation studies further reveal the complementary role of POI names and the advantages of leveraging pretrained text encoders for spatial representations. Overall, our findings highlight the potential of integrating fine-grained semantic attributes and multimodal learning techniques to advance the development of urban foundation models.