Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In weakly supervised few-shot semantic segmentation, meta-learning methods suffer from semantic homogenization due to homogeneous support-query pair sampling and monolithic network architectures. To address this, we propose a Homologous-Heterogeneous Network (HHN) framework that models support and query samples from dual perspectives. HHN introduces Heterogeneous Visual Aggregation (HVA) and Heterogeneous Transfer (HT) modules to enhance semantic complementarity, suppress noise, and integrate heterogeneous CLIP text priors to improve multimodal generalization. Notably, our method is the first weakly supervised approach—relying solely on image-level annotations—to surpass fully supervised pixel-level methods in performance. On Pascal-5i and COCO-20i, it achieves absolute mIoU gains of 13.2% and 9.7%, respectively, while using only 1/24 the parameters of current state-of-the-art models—demonstrating unprecedented balance between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5 extsuperscript{i} and a 9.7% improvement on COCO-20 extsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.
Problem

Research questions and friction points this paper is trying to address.

Addresses over-semantic homogenization in meta-learning networks
Proposes heterogeneous modules to enhance complementarity in segmentation
Improves weakly-supervised few-shot segmentation with fewer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Homologous heterogeneous network for segmentation
Heterogeneous visual aggregation modules enhance complementarity
Heterogeneous CLIP text boosts multimodal generalization
🔎 Similar Papers
No similar papers found.
J
Jiaqi Ma
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Guo-Sen Xie
Guo-Sen Xie
Professor, Nanjing University of Science and Technology
Computer VisionMachine Learning
F
Fang Zhao
School of Intelligence Science and Technology, Nanjing University, Suzhou 215163, China
Z
Zechao Li
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China