Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address error accumulation and training instability caused by pseudo-label drift in semi-supervised remote sensing image semantic segmentation, this paper proposes a heterogeneous dual-student Vision Transformer (ViT) framework. The method integrates a ViT backbone, dual-student collaborative training, multi-granularity feature interaction, and text-driven guidance. Its core contributions are: (1) a novel explicit-implicit semantic co-guidance mechanism that jointly leverages CLIP text embeddings and learnable queries to enforce explicit semantic priors while enabling implicit semantic modeling; and (2) a CLIP-DINOv3-driven global-local feature collaboration strategy that enhances cross-domain generalization robustness. Evaluated on six mainstream remote sensing datasets under diverse annotation ratios and scene conditions, the proposed approach consistently achieves state-of-the-art performance—significantly outperforming existing methods in both segmentation accuracy and training stability.

Technology Category

Application Category

📝 Abstract
Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.
Problem

Research questions and friction points this paper is trying to address.

Mitigates pseudo-label drift in semi-supervised remote sensing segmentation
Fuses vision-language and self-supervised priors for semantic consistency
Integrates global context and local details to improve segmentation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous dual-student architecture with CLIP and DINOv3
Explicit-implicit semantic co-guidance using text embeddings and queries
Global-local feature collaborative fusion for contextual and detailed information
🔎 Similar Papers
Y
Yi Zhou
School of Computer Technology and Application, Qinghai University, Xining, China
X
Xuechao Zou
Key Lab of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China
S
Shun Zhang
Key Lab of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China
K
Kai Li
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Shiying Wang
Shiying Wang
Yale University
GeneticsNeuroimaging.
J
Jingming Chen
School of Computer Technology and Application, Qinghai University, Xining, China
Congyan Lang
Congyan Lang
Beijing Jiaotong University
computer vision
T
Tengfei Cao
School of Computer Technology and Application, Qinghai University, Xining, China
P
Pin Tao
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Yuanchun Shi
Yuanchun Shi
Professor
human computer interaction