🤖 AI Summary
This work addresses the challenge in remote sensing image-text retrieval of simultaneously achieving fine-grained cross-modal alignment and efficient search. To this end, the authors propose a “fast-then-fine” two-stage framework: in the first stage, text-agnostic coarse representations enable efficient candidate recall; in the second stage, a parameter-free text-guided interaction module performs fine-grained re-ranking. By integrating multi-granularity representation learning with a cross-modal alignment loss that jointly optimizes intra- and inter-modal relationships, the method significantly improves retrieval efficiency while attaining competitive accuracy—all without increasing model parameters. Extensive experiments on public remote sensing benchmarks validate the effectiveness of the proposed approach.
📝 Abstract
Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.