Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the challenge in remote sensing image-text retrieval of simultaneously achieving fine-grained cross-modal alignment and efficient search. To this end, the authors propose a “fast-then-fine” two-stage framework: in the first stage, text-agnostic coarse representations enable efficient candidate recall; in the second stage, a parameter-free text-guided interaction module performs fine-grained re-ranking. By integrating multi-granularity representation learning with a cross-modal alignment loss that jointly optimizes intra- and inter-modal relationships, the method significantly improves retrieval efficiency while attaining competitive accuracy—all without increasing model parameters. Extensive experiments on public remote sensing benchmarks validate the effectiveness of the proposed approach.

Technology Category

Application Category

📝 Abstract
Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.
Problem

Research questions and friction points this paper is trying to address.

cross-modal retrieval
remote sensing
fine-grained alignment
retrieval efficiency
multi-granular representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

two-stage retrieval
multi-granular representation
parameter-free interaction
cross-modal alignment
remote sensing retrieval
🔎 Similar Papers
2024-09-20IEEE Transactions on Geoscience and Remote SensingCitations: 2