Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of vision-language models such as CLIP in open-vocabulary semantic segmentation of remote sensing imagery, which stems from improper interactions within self-attention layers. To this end, the authors propose ReSeg-CLIP, a training-free approach that introduces hierarchical attention masking and a multi-model fusion mechanism tailored for remote sensing scenarios. Specifically, multi-scale masks generated by SAM are leveraged to constrain CLIP’s self-attention interactions, while multiple remote sensing–specialized CLIP variants are fused through a weighted ensemble guided by text-prompt-based representation quality assessment. Evaluated under a zero-shot setting on three remote sensing benchmark datasets, ReSeg-CLIP achieves state-of-the-art performance, significantly enhancing CLIP’s open-vocabulary comprehension capabilities in remote sensing contexts.

Technology Category

Application Category

📝 Abstract
In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Semantic Segmentation
Remote Sensing
Vision Language Models
Self-Attention
Semantic Segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary Semantic Segmentation
Hierarchical Attention Masking
Model Composition
Remote Sensing
CLIP
🔎 Similar Papers
No similar papers found.
M
Mohammadreza Heidarianbaei
Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany
M
Mareike Dorozynski
Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany
H
Hubert Kanyamahanga
Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany
Max Mehltretter
Max Mehltretter
Institute of Photogrammetry and GeoInformation, Leibniz Universität Hannover
PhotogrammetryComputer Vision3D Reconstruction
Franz Rottensteiner
Franz Rottensteiner
Institute of Photogrammetry and GeoInformation, Leibniz Universität Hannover
PhotogrammetryRemote Sensing