ZeroDiff: Solidified Visual-Semantic Correlation in Zero-Shot Learning

📅 2024-06-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In zero-shot learning (ZSL), severe spurious visual-semantic correlations arise when visible-class samples are scarce, critically degrading generative ZSL performance. To address this, we propose DiffCZSL—the first generative framework integrating diffusion-based data augmentation, supervised contrastive representation learning, and a multi-view Wasserstein mutual-learning discriminator. It jointly mitigates spurious correlations through three complementary mechanisms: (i) synthesizing semantically consistent visual features via diffusion modeling guided by supervised contrastive loss; (ii) disentangling cross-modal representations using contrastive supervision; and (iii) enforcing robust generation evaluation via Wasserstein distance-constrained multi-view discriminators. DiffCZSL achieves significant improvements over state-of-the-art methods on three standard ZSL benchmarks, demonstrating exceptional generalization and stability even under extreme data scarcity—e.g., only 1–2 samples per visible class. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Zero-shot Learning (ZSL) aims to enable classifiers to identify unseen classes. This is typically achieved by generating visual features for unseen classes based on learned visual-semantic correlations from seen classes. However, most current generative approaches heavily rely on having a sufficient number of samples from seen classes. Our study reveals that a scarcity of seen class samples results in a marked decrease in performance across many generative ZSL techniques. We argue, quantify, and empirically demonstrate that this decline is largely attributable to spurious visual-semantic correlations. To address this issue, we introduce ZeroDiff, an innovative generative framework for ZSL that incorporates diffusion mechanisms and contrastive representations to enhance visual-semantic correlations. ZeroDiff comprises three key components: (1) Diffusion augmentation, which naturally transforms limited data into an expanded set of noised data to mitigate generative model overfitting; (2) Supervised-contrastive (SC)-based representations that dynamically characterize each limited sample to support visual feature generation; and (3) Multiple feature discriminators employing a Wasserstein-distance-based mutual learning approach, evaluating generated features from various perspectives, including pre-defined semantics, SC-based representations, and the diffusion process. Extensive experiments on three popular ZSL benchmarks demonstrate that ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Our codes are available at https://github.com/FouriYe/ZeroDiff_ICLR25.

Problem

Research questions and friction points this paper is trying to address.

ZeroDiff enhances visual-semantic correlation in zero-shot learning.

It addresses performance decline due to sparse seen class samples.

ZeroDiff integrates diffusion and contrastive methods for robust feature generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion augmentation for data expansion

Supervised-contrastive based representations

Wasserstein-distance mutual learning

🔎 Similar Papers

RevCD - Reversed Conditional Diffusion for Generalized Zero-Shot Learning