🤖 AI Summary
Weakly supervised semantic segmentation (WSSS) with image-level labels suffers from spurious correlations (e.g., spatial co-occurrence of industrial smoke and chimneys), incomplete foreground coverage, and imprecise boundaries. Method: We propose a cross-architecture consistency regularization framework, introducing the first CNN–ViT teacher–student architecture. A cross-architecture knowledge distillation loss enforces representation alignment without additional supervision, mitigating models’ inherent bias toward contextual co-occurrence. Combined with consistency regularization and pseudo-mask post-processing, this significantly improves pseudo-label quality. Contribution/Results: On challenging domains such as industrial smoke segmentation—characterized by strong contextual coupling—the method effectively alleviates co-occurrence bias, enhances foreground completeness, and refines boundary localization. Our approach establishes a novel paradigm for WSSS in contexts with high inter-class spatial dependency.
📝 Abstract
Scarcity of pixel-level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys.
Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model's inherent bias toward co-occurring context. To address this, we propose a novel WSSS framework that directly targets the co-occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher-student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross-architecture consistency by aligning internal representations. Additionally, we incorporate post-processing techniques to address partial coverage and further improve pseudo mask quality.