VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of sparse spatial correspondences and inaccurate pose estimation arising from the substantial viewpoint disparity between ground-level and satellite imagery. To bridge this cross-view gap, the authors propose a dual-axis transformation strategy: polar coordinate transformation establishes horizontal correspondences, while a context-enhanced positional attention mechanism aligns vertical structures. This explicit alignment is further reinforced by a novel view reconstruction loss that promotes both viewpoint invariance and cross-view consistency, enabling high-precision pose estimation without requiring prior orientation information. Evaluated on the KITTI and VIGOR benchmarks, the method reduces position errors by 50.7% and 76.5%, and orientation errors by 18.0% and 46.8%, respectively, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

Problem

Research questions and friction points this paper is trying to address.

cross-view pose estimation

viewpoint gap

spatial correspondences

view-invariant representation

camera pose

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-axis transformation

view-invariant representation

cross-view pose estimation