Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In weakly supervised few-shot semantic segmentation, meta-learning methods suffer from semantic homogenization due to homogeneous support-query pair sampling and monolithic network architectures. To address this, we propose a Homologous-Heterogeneous Network (HHN) framework that models support and query samples from dual perspectives. HHN introduces Heterogeneous Visual Aggregation (HVA) and Heterogeneous Transfer (HT) modules to enhance semantic complementarity, suppress noise, and integrate heterogeneous CLIP text priors to improve multimodal generalization. Notably, our method is the first weakly supervised approach—relying solely on image-level annotations—to surpass fully supervised pixel-level methods in performance. On Pascal-5i and COCO-20i, it achieves absolute mIoU gains of 13.2% and 9.7%, respectively, while using only 1/24 the parameters of current state-of-the-art models—demonstrating unprecedented balance between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5 extsuperscript{i} and a 9.7% improvement on COCO-20 extsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.

Problem

Research questions and friction points this paper is trying to address.

Addresses over-semantic homogenization in meta-learning networks

Proposes heterogeneous modules to enhance complementarity in segmentation

Improves weakly-supervised few-shot segmentation with fewer parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Homologous heterogeneous network for segmentation

Heterogeneous visual aggregation modules enhance complementarity

Heterogeneous CLIP text boosts multimodal generalization

🔎 Similar Papers

Judging from Support-set: A New Way to Utilize Few-Shot Segmentation for Segmentation Refinement Process