🤖 AI Summary
Existing compositional zero-shot learning (CZSL) methods rely on simplistic composition-to-prototype mappings, failing to model semantic subset partitioning; moreover, their one-to-all cross-modal matching overlooks fine-grained distinctions among state-object compositions, limiting image-composition alignment accuracy. To address these limitations, we propose a Mixture-of-Experts (MoE)-based framework. Its core contributions are: (1) a domain-expert adaptive mechanism enabling token-aware primitive representation learning; and (2) a semantic variant alignment strategy that supports fine-grained recognition of state-object compositions. The framework integrates MoE architecture, cross-modal alignment, semantic variant selection, and deep optimization techniques. Extensive experiments on three benchmark datasets—under both closed-world and open-world settings—demonstrate substantial improvements over state-of-the-art methods, validating its effectiveness in semantic generalization and precise compositional alignment.
📝 Abstract
Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.