RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Remote sensing visual grounding (RSVG) faces a fundamental bottleneck in open-vocabulary scenarios: existing methods either rely on closed vocabularies or require costly fine-tuning. This paper proposes the first zero-shot, training-free open-vocabulary RSVG framework, synergistically leveraging frozen general-purpose vision-language models and diffusion model priors. Specifically, cross-modal attention generates image-text alignment heatmaps; shape priors from diffusion models reconstruct object structures; and an attention evolution module suppresses background interference. Evaluated across multiple remote sensing benchmarks, our method significantly outperforms state-of-the-art weakly supervised and zero-shot approaches. Crucially, it achieves high-quality open-vocabulary visual grounding without any parameter updates—marking the first such zero-shot solution—while maintaining computational efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose extbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attentionfootnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.
Problem

Research questions and friction points this paper is trying to address.

Localizing objects in remote sensing images using natural language queries
Overcoming closed-set vocabulary limitations in open-world scenarios
Eliminating expensive dataset requirements and time-consuming fine-tuning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework with frozen foundation models
Combines vision-language model cross-attention maps
Uses diffusion model priors for structural refinement
🔎 Similar Papers
No similar papers found.
K
Ke Li
School of Computer Science and Technology, Xidian University, 710126, China.
D
Di Wang
School of Computer Science and Technology, Xidian University, 710126, China.
T
Ting Wang
School of Computer Science and Technology, Xidian University, 710126, China.
F
Fuyu Dong
School of Computer Science and Technology, Xidian University, 710126, China.
Y
Yiming Zhang
University of California San Diego, USA.
Luyao Zhang
Luyao Zhang
Duke Kunshan University
algorithmic game theorymechanism designmachine learningblockchainexplainable AI
Xiangyu Wang
Xiangyu Wang
Professor, Curtin University
Civil EngineeringBuilding Information ModelingSmart CityAutomation and RoboticsSmart
Shaofeng Li
Shaofeng Li
Southeast University
AI SecurityBackdoor Attacks
Q
Quan Wang
School of Computer Science and Technology, Xidian University, 710126, China.