🤖 AI Summary
This work addresses the domain-specific challenges in remote sensing image segmentation—such as high annotation costs and the unique overhead perspective—and the absence of generalizable, reasoning-driven approaches. To this end, the authors propose GeoSeg, a training-free, zero-shot segmentation framework that leverages multimodal large language models (MLLMs) to enable precise segmentation guided by natural language instructions. The method introduces two key innovations: a bias-aware coordinate correction mechanism and a dual-path prompting strategy, which effectively integrate semantic intent with fine-grained spatial details. Comprehensive experiments on the newly curated GeoSeg-Bench benchmark demonstrate that GeoSeg significantly outperforms existing baselines, while ablation studies confirm the effectiveness and necessity of each proposed component.
📝 Abstract
Recent advances in MLLMs are reframing segmentation from fixed-category prediction to instruction-grounded localization. While reasoning based segmentation has progressed rapidly in natural scenes, remote sensing lacks a generalizable solution due to the prohibitive cost of reasoning-oriented data and domain-specific challenges like overhead viewpoints. We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation. GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues. We also introduce GeoSeg-Bench, a diagnostic benchmark of 810 image--query pairs with hierarchical difficulty levels. Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.