Context Patch Fusion with Class Token Enhancement for Weakly Supervised Semantic Segmentation

📅 2026-01-21
🏛️ Computer Modeling in Engineering & Sciences
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of weakly supervised semantic segmentation, where neglecting complex contextual dependencies among image patches leads to incomplete local representations and suboptimal segmentation performance. To overcome this, we propose a novel framework that integrates context-aware modeling with category-specific semantics. Specifically, we introduce a Context-Fused Bidirectional LSTM (CF-BiLSTM) to capture bidirectional spatial dependencies among image patches and incorporate learnable dynamic class tokens to enhance class-specific semantic representation. Relying solely on image-level labels, our method achieves significant performance gains over existing weakly supervised approaches, demonstrating markedly improved segmentation accuracy on both the PASCAL VOC 2012 and MS COCO 2014 benchmarks.

Technology Category

Application Category

📝 Abstract
Weakly Supervised Semantic Segmentation (WSSS), which relies only on image-level labels, has attracted significant attention for its cost-effectiveness and scalability. Existing methods mainly enhance inter-class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class-specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF-CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF-CTE consistently surpasses prior WSSS methods.
Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Semantic Segmentation
Contextual Dependencies
Patch Relations
Semantic Ambiguity
Image-level Labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context Patch Fusion
Class Token Enhancement
CF-BiLSTM
Weakly Supervised Semantic Segmentation
Spatial Context Modeling
🔎 Similar Papers
No similar papers found.
Y
Yiyang Fu
School of Cyber Science and Engineering, Wuxi University, Wuxi 214105, China
Hui Li
Hui Li
Xiamen University
Information RetrievalData MiningData Management
Wangyu Wu
Wangyu Wu
The University of Liverpool