Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the challenge in remote sensing image-text retrieval of simultaneously achieving fine-grained cross-modal alignment and efficient search. To this end, the authors propose a “fast-then-fine” two-stage framework: in the first stage, text-agnostic coarse representations enable efficient candidate recall; in the second stage, a parameter-free text-guided interaction module performs fine-grained re-ranking. By integrating multi-granularity representation learning with a cross-modal alignment loss that jointly optimizes intra- and inter-modal relationships, the method significantly improves retrieval efficiency while attaining competitive accuracy—all without increasing model parameters. Extensive experiments on public remote sensing benchmarks validate the effectiveness of the proposed approach.

Technology Category

Application Category

📝 Abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

Problem

Research questions and friction points this paper is trying to address.

cross-modal retrieval

remote sensing

fine-grained alignment

retrieval efficiency

multi-granular representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

two-stage retrieval

multi-granular representation

parameter-free interaction