BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of cross-view geo-localization between aerial platforms such as drones and satellites, where significant geometric discrepancies hinder accurate matching. To bridge this cross-view geometric gap while preserving model generalization, the authors propose a parameter-efficient BGG framework built upon a vision foundation model (e.g., DINOv3). The approach integrates a multi-granularity feature enhancement adapter and a frequency-aware structural aggregation module, leveraging multi-scale dilated convolutions, frequency-domain modulation, and a fusion strategy that combines [CLS] tokens with local features. Evaluated on the University-1652 and SUES-200 benchmarks, the method achieves state-of-the-art localization accuracy with substantially lower training costs compared to existing approaches.

📝 Abstract

Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

Problem

Research questions and friction points this paper is trying to address.

Cross-View Geo-Localization

Geometric Gap

Vision Foundation Model

Image Retrieval

Geolocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-View Geo-Localization

Vision Foundation Model Adaptation

Geometric Gap Bridging