Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge that existing remote sensing image semantic segmentation methods struggle to leverage non-visual textual information, resulting in a semantic gap between visual features and true scene semantics. To bridge this gap, the authors propose TSMNet, the first framework to introduce multi-granularity textual supervision into remote sensing segmentation. TSMNet employs a dual-branch text encoder to separately capture scene-level semantics and object-level labels, and incorporates a text-guided visual-semantic fusion module to enable end-to-end open-vocabulary semantic segmentation. The approach establishes a novel, interpretable paradigm with strong cross-domain generalization capabilities and introduces two new multimodal remote sensing datasets. Extensive experiments demonstrate that TSMNet significantly outperforms current state-of-the-art models across diverse geographic regions and sensor modalities, achieving both high segmentation accuracy and robust generalization.

Technology Category

Application Category

📝 Abstract

Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

multimodal remote sensing

textual supervision

semantic gap

land use/land cover mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary segmentation

text-supervised learning

multimodal remote sensing