Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the challenge that existing remote sensing image semantic segmentation methods struggle to leverage non-visual textual information, resulting in a semantic gap between visual features and true scene semantics. To bridge this gap, the authors propose TSMNet, the first framework to introduce multi-granularity textual supervision into remote sensing segmentation. TSMNet employs a dual-branch text encoder to separately capture scene-level semantics and object-level labels, and incorporates a text-guided visual-semantic fusion module to enable end-to-end open-vocabulary semantic segmentation. The approach establishes a novel, interpretable paradigm with strong cross-domain generalization capabilities and introduces two new multimodal remote sensing datasets. Extensive experiments demonstrate that TSMNet significantly outperforms current state-of-the-art models across diverse geographic regions and sensor modalities, achieving both high segmentation accuracy and robust generalization.

Technology Category

Application Category

📝 Abstract
Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation
multimodal remote sensing
textual supervision
semantic gap
land use/land cover mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary segmentation
text-supervised learning
multimodal remote sensing
cross-modal fusion
semantic segmentation
🔎 Similar Papers
No similar papers found.
J
Jinkun Dai
Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 611756, China; State-Province Joint Engineering Laboratory of Spatial Information Technology for High-Speed Railway Safety, Southwest Jiaotong University, Chengdu 611756, China
Yuanxin Ye
Yuanxin Ye
Full Professor, Southwest Jiaotong University
remote sensing image processingcomputer vision
Peng Tang
Peng Tang
Meta
Multi-modal LLMVision LanguageComputer Vision
T
Tengfeng Tang
Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 611756, China; State-Province Joint Engineering Laboratory of Spatial Information Technology for High-Speed Railway Safety, Southwest Jiaotong University, Chengdu 611756, China
X
Xianping Ma
Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 611756, China; State-Province Joint Engineering Laboratory of Spatial Information Technology for High-Speed Railway Safety, Southwest Jiaotong University, Chengdu 611756, China
Jing Xiao
Jing Xiao
Beijing Key Laboratory of Learning and Cognition, School of Psychology, Capital Normal University
cognitive vulnerability to depressionschool psychologycognition and learning
M
Mi Wang
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430072, China