ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of grounding natural language expressions in remote sensing images, where fine-grained linguistic cues—such as spatial relations and object attributes—are often underutilized, leading to difficulties in distinguishing visually similar targets. To overcome this limitation, the authors propose a language disentanglement strategy that decomposes referring expressions into global context, spatial relations, and object attributes. They further introduce a progressive cross-modal modulation mechanism—termed “overview–localize–verify”—to enable coarse-to-fine vision–language alignment. Integrated with multi-scale feature fusion and a language-guided calibration decoder, this approach forms a unified multi-task framework capable of both referring expression comprehension and segmentation. Extensive experiments demonstrate that the method achieves state-of-the-art performance on the RRSIS-D and RISBench benchmarks, significantly outperforming existing approaches.
📝 Abstract
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

visual grounding
remote sensing imagery
language decoupling
spatial relations
object attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual grounding
language decoupling
progressive alignment
remote sensing imagery
cross-modal modulation
🔎 Similar Papers
No similar papers found.
K
Ke Li
Xidian University
T
Ting Wang
Xidian University
D
Di Wang
Xidian University
Y
Yongshan Zhu
Xidian University
Yiming Zhang
Yiming Zhang
UC San Diego
Machine Learning
Tao Lei
Tao Lei
Shaanxi University of Science and Technology
Image ProcessingMachine Learning
Q
Quan Wang
Xidian University