VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of sparse spatial correspondences and inaccurate pose estimation arising from the substantial viewpoint disparity between ground-level and satellite imagery. To bridge this cross-view gap, the authors propose a dual-axis transformation strategy: polar coordinate transformation establishes horizontal correspondences, while a context-enhanced positional attention mechanism aligns vertical structures. This explicit alignment is further reinforced by a novel view reconstruction loss that promotes both viewpoint invariance and cross-view consistency, enabling high-precision pose estimation without requiring prior orientation information. Evaluated on the KITTI and VIGOR benchmarks, the method reduces position errors by 50.7% and 76.5%, and orientation errors by 18.0% and 46.8%, respectively, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.
Problem

Research questions and friction points this paper is trying to address.

cross-view pose estimation
viewpoint gap
spatial correspondences
view-invariant representation
camera pose
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-axis transformation
view-invariant representation
cross-view pose estimation
polar transformation
positional attention
🔎 Similar Papers
No similar papers found.