EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing compositional zero-shot learning (CZSL) methods rely on simplistic composition-to-prototype mappings, failing to model semantic subset partitioning; moreover, their one-to-all cross-modal matching overlooks fine-grained distinctions among state-object compositions, limiting image-composition alignment accuracy. To address these limitations, we propose a Mixture-of-Experts (MoE)-based framework. Its core contributions are: (1) a domain-expert adaptive mechanism enabling token-aware primitive representation learning; and (2) a semantic variant alignment strategy that supports fine-grained recognition of state-object compositions. The framework integrates MoE architecture, cross-modal alignment, semantic variant selection, and deep optimization techniques. Extensive experiments on three benchmark datasets—under both closed-world and open-world settings—demonstrate substantial improvements over state-of-the-art methods, validating its effectiveness in semantic generalization and precise compositional alignment.

Technology Category

Application Category

📝 Abstract

Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.

Problem

Research questions and friction points this paper is trying to address.

Improving recognition of unknown state-object pairs in CZSL

Enhancing fine-grained image-composition alignment in CZSL

Optimizing primitive representation learning for compositional generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts for token-aware learning

Semantic variant alignment for image-primitives matching

Domain-expert adaption for primitive representations

🔎 Similar Papers

Enabling Small Models for Zero-Shot Selection and Reuse through Model Label Learning