RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Remote sensing visual grounding (RSVG) faces a fundamental bottleneck in open-vocabulary scenarios: existing methods either rely on closed vocabularies or require costly fine-tuning. This paper proposes the first zero-shot, training-free open-vocabulary RSVG framework, synergistically leveraging frozen general-purpose vision-language models and diffusion model priors. Specifically, cross-modal attention generates image-text alignment heatmaps; shape priors from diffusion models reconstruct object structures; and an attention evolution module suppresses background interference. Evaluated across multiple remote sensing benchmarks, our method significantly outperforms state-of-the-art weakly supervised and zero-shot approaches. Crucially, it achieves high-quality open-vocabulary visual grounding without any parameter updates—marking the first such zero-shot solution—while maintaining computational efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose extbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attentionfootnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

Problem

Research questions and friction points this paper is trying to address.

Localizing objects in remote sensing images using natural language queries

Overcoming closed-set vocabulary limitations in open-world scenarios

Eliminating expensive dataset requirements and time-consuming fine-tuning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework with frozen foundation models

Combines vision-language model cross-attention maps

Uses diffusion model priors for structural refinement

🔎 Similar Papers

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community