🤖 AI Summary
Cross-View Object Geolocalization (CVOGL) aims to precisely localize a target object in high-resolution satellite imagery given a ground-level query image with point-based spatial cues. Existing approaches formulate CVOGL as a one-shot detection task, rendering them vulnerable to cross-view feature noise and lacking mechanisms for iterative refinement. This paper proposes ReCOT, the first method to recast CVOGL as a recurrent optimization problem. ReCOT introduces learnable prompt tokens to iteratively attend to and enhance reference features; incorporates SAM-derived segmentation priors for knowledge distillation; and designs a Hierarchical Attention mechanism alongside a Reference Feature Enhancement Module (RFEM) to improve matching robustness. Evaluated on standard benchmarks, ReCOT achieves state-of-the-art performance with 60% fewer parameters than prior art, demonstrating significant gains in both localization accuracy and computational efficiency.
📝 Abstract
Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.