Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current remote sensing instance segmentation methods rely on end-to-end supervised fine-tuning, tightly coupling semantic reasoning with pixel-level prediction—leading to weak geometric grounding and poor cross-task generalization. This paper proposes Think2Seg-RS, the first framework to decouple “semantic reasoning” from “geometric execution”: a frozen large vision-language model (LVLM) generates structured geometric instructions, which guide a frozen Segment Anything Model (SAM) to perform precise segmentation. We theoretically distinguish semantic-level from instance-level localization; empirically reveal that smaller segmentation heads outperform larger ones under semantic supervision; and identify the ineffectiveness of negative prompting in aerial imagery. Think2Seg-RS achieves state-of-the-art performance on EarthReason and demonstrates zero-shot transfer capability across multiple referring segmentation benchmarks. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.

Problem

Research questions and friction points this paper is trying to address.

Decouples semantic reasoning from geometric prediction in remote sensing segmentation

Addresses weak geometric grounding in vision-language models for aerial imagery

Enables zero-shot generalization across multiple referring segmentation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled LVLM-SAM framework for reasoning segmentation

Mask-only reinforcement learning for geometric prompts

Semantic-level reasoning segmentation for geospatial understanding

🔎 Similar Papers

No similar papers found.