Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging 3-DoF visual localization problem across heterogeneous viewpoints (vehicle-mounted video ↔ satellite imagery) in unstructured off-road environments under GPS-denied conditions—where perceptual ambiguity and cross-view alignment difficulties arise from repetitive vegetation, irregular terrain morphology, and significant seasonal appearance variations. We propose a robust visual localization framework featuring: (i) pose-aware positive sample selection and temporally aligned hard negative mining; (ii) motion-guided frame sampling and a lightweight temporal aggregator; and (iii) an entropy-regulated temperature-scaled multi-hypothesis Monte Carlo tracking mechanism. Leveraging self-supervised spatiotemporal contrastive learning and cross-view matching, our method achieves state-of-the-art performance on TartanDrive 2.0 using only <30 minutes of training data: 93% of localization errors ≤25 m and 100% ≤50 m over a 12.29 km test trajectory, with strong generalization across geographic regions and sensor platforms.

Technology Category

Application Category

📝 Abstract
Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.
Problem

Research questions and friction points this paper is trying to address.

Robust 3-DoF localization in GPS-denied off-road terrains
Overcoming perceptual ambiguities from repetitive vegetation and terrain
Addressing seasonal appearance shifts for satellite alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised cross-view video localization framework
Pose-dependent positive sampling strategy
Motion-informed frame sampler and temporal aggregator
🔎 Similar Papers
No similar papers found.