🤖 AI Summary
In weakly supervised few-shot semantic segmentation, meta-learning methods suffer from semantic homogenization due to homogeneous support-query pair sampling and monolithic network architectures. To address this, we propose a Homologous-Heterogeneous Network (HHN) framework that models support and query samples from dual perspectives. HHN introduces Heterogeneous Visual Aggregation (HVA) and Heterogeneous Transfer (HT) modules to enhance semantic complementarity, suppress noise, and integrate heterogeneous CLIP text priors to improve multimodal generalization. Notably, our method is the first weakly supervised approach—relying solely on image-level annotations—to surpass fully supervised pixel-level methods in performance. On Pascal-5i and COCO-20i, it achieves absolute mIoU gains of 13.2% and 9.7%, respectively, while using only 1/24 the parameters of current state-of-the-art models—demonstrating unprecedented balance between accuracy and efficiency.
📝 Abstract
Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5 extsuperscript{i} and a 9.7% improvement on COCO-20 extsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.