SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low visual localization accuracy in satellite imagery caused by complex backgrounds and small targets, this paper proposes a spatially aware structured localization method. We design a vision-language model (VLM)–grounding module co-architecture guided by dedicated control tokens, explicitly embedding spatial structural modeling into a general-purpose VLM. Instruction tuning jointly optimizes language understanding and spatial reasoning capabilities, supported by a novel multi-task remote sensing instruction dataset. Our approach is the first to enable end-to-end, control-token-driven collaboration between a VLM and a grounding module. It establishes new state-of-the-art (SOTA) performance across multiple remote sensing benchmarks, improving visual localization accuracy by 24.8% over prior best methods. This advancement significantly enhances fine-grained object localization robustness in high-clutter remote sensing scenarios.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
Problem

Research questions and friction points this paper is trying to address.

Enhances visual grounding in satellite imagery via structured localization.
Improves object localization in complex scenes using spatial reasoning.
Advances remote sensing analysis by integrating language and spatial data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetuning pretrained VLM on diverse instruction tasks
Using control tokens to interface a dedicated grounding module
Integrating structured spatial reasoning for precise localization
🔎 Similar Papers
No similar papers found.