Mitigating Spurious Correlations in Weakly Supervised Semantic Segmentation via Cross-architecture Consistency Regularization

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weakly supervised semantic segmentation (WSSS) with image-level labels suffers from spurious correlations (e.g., spatial co-occurrence of industrial smoke and chimneys), incomplete foreground coverage, and imprecise boundaries. Method: We propose a cross-architecture consistency regularization framework, introducing the first CNN–ViT teacher–student architecture. A cross-architecture knowledge distillation loss enforces representation alignment without additional supervision, mitigating models’ inherent bias toward contextual co-occurrence. Combined with consistency regularization and pseudo-mask post-processing, this significantly improves pseudo-label quality. Contribution/Results: On challenging domains such as industrial smoke segmentation—characterized by strong contextual coupling—the method effectively alleviates co-occurrence bias, enhances foreground completeness, and refines boundary localization. Our approach establishes a novel paradigm for WSSS in contexts with high inter-class spatial dependency.

Technology Category

Application Category

📝 Abstract
Scarcity of pixel-level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys. Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model's inherent bias toward co-occurring context. To address this, we propose a novel WSSS framework that directly targets the co-occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher-student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross-architecture consistency by aligning internal representations. Additionally, we incorporate post-processing techniques to address partial coverage and further improve pseudo mask quality.
Problem

Research questions and friction points this paper is trying to address.

Addressing spurious correlations in weakly supervised segmentation
Reducing reliance on external knowledge for WSSS
Improving pseudo mask quality via cross-architecture consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-student framework with CNNs and ViTs
Cross-architecture consistency via representation alignment
Post-processing for pseudo mask quality improvement
🔎 Similar Papers
No similar papers found.