Recurrent Cross-View Object Geo-Localization

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Cross-View Object Geolocalization (CVOGL) aims to precisely localize a target object in high-resolution satellite imagery given a ground-level query image with point-based spatial cues. Existing approaches formulate CVOGL as a one-shot detection task, rendering them vulnerable to cross-view feature noise and lacking mechanisms for iterative refinement. This paper proposes ReCOT, the first method to recast CVOGL as a recurrent optimization problem. ReCOT introduces learnable prompt tokens to iteratively attend to and enhance reference features; incorporates SAM-derived segmentation priors for knowledge distillation; and designs a Hierarchical Attention mechanism alongside a Reference Feature Enhancement Module (RFEM) to improve matching robustness. Evaluated on standard benchmarks, ReCOT achieves state-of-the-art performance with 60% fewer parameters than prior art, demonstrating significant gains in both localization accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.

Problem

Research questions and friction points this paper is trying to address.

Recurrent localization for cross-view object geo-localization

Addressing feature noise and error correction mechanisms

Reducing parameters while achieving state-of-the-art performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Transformer for iterative location refinement

SAM-based knowledge distillation for semantic guidance

Hierarchical attention module for feature enhancement

🔎 Similar Papers

No similar papers found.