TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the limitation of existing geospatial multimodal self-supervised methods, which primarily focus on static location alignment and struggle to capture the dynamic urban activities embedded in continuous human mobility trajectories. To overcome this, we propose the first trajectory-centric multimodal self-supervised learning framework that introduces continuous neural trajectory representations to achieve fine-grained alignment among trajectories, street-view images, and geographic coordinates. By moving beyond conventional discrete location matching, our approach enables precise correspondence at arbitrary points along a path. Extensive experiments demonstrate that the proposed model significantly outperforms current geospatial multimodal models and trajectory foundation models across four urban mobility and road understanding tasks, validating the effectiveness of continuous trajectory modeling and multimodal fusion.

📝 Abstract

Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.

Problem

Research questions and friction points this paper is trying to address.

trajectory

multimodal self-supervised learning

geospatial alignment

urban mobility

continuous movement

Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-centric learning

geospatially aligned neural representations

multimodal self-supervised learning