Recurrent Cross-View Object Geo-Localization

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-View Object Geolocalization (CVOGL) aims to precisely localize a target object in high-resolution satellite imagery given a ground-level query image with point-based spatial cues. Existing approaches formulate CVOGL as a one-shot detection task, rendering them vulnerable to cross-view feature noise and lacking mechanisms for iterative refinement. This paper proposes ReCOT, the first method to recast CVOGL as a recurrent optimization problem. ReCOT introduces learnable prompt tokens to iteratively attend to and enhance reference features; incorporates SAM-derived segmentation priors for knowledge distillation; and designs a Hierarchical Attention mechanism alongside a Reference Feature Enhancement Module (RFEM) to improve matching robustness. Evaluated on standard benchmarks, ReCOT achieves state-of-the-art performance with 60% fewer parameters than prior art, demonstrating significant gains in both localization accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract
Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.
Problem

Research questions and friction points this paper is trying to address.

Recurrent localization for cross-view object geo-localization
Addressing feature noise and error correction mechanisms
Reducing parameters while achieving state-of-the-art performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Transformer for iterative location refinement
SAM-based knowledge distillation for semantic guidance
Hierarchical attention module for feature enhancement
🔎 Similar Papers
No similar papers found.
X
Xiaohan Zhang
College of Information Science and Electronic Engineering, Zhejiang University
Si-Yuan Cao
Si-Yuan Cao
Zhejiang University
image alignmenthomography estimationimage fusionplace recognition
Xiaokai Bai
Xiaokai Bai
Zhejiang University Ph.D student
Multimodal Fusion3D object detection4D Radar Perceptionautonomous driving
Y
Yiming Li
College of Information Science and Electronic Engineering, Zhejiang University
Z
Zhangkai Shen
College of Information Science and Electronic Engineering, Zhejiang University
Z
Zhe Wu
College of Information Science and Electronic Engineering, Zhejiang University
X
Xiaoxi Hu
State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University, Tsinghua University
H
Hui-liang Shen
College of Information Science and Electronic Engineering, Zhejiang University