EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compositional zero-shot learning (CZSL) methods rely on simplistic composition-to-prototype mappings, failing to model semantic subset partitioning; moreover, their one-to-all cross-modal matching overlooks fine-grained distinctions among state-object compositions, limiting image-composition alignment accuracy. To address these limitations, we propose a Mixture-of-Experts (MoE)-based framework. Its core contributions are: (1) a domain-expert adaptive mechanism enabling token-aware primitive representation learning; and (2) a semantic variant alignment strategy that supports fine-grained recognition of state-object compositions. The framework integrates MoE architecture, cross-modal alignment, semantic variant selection, and deep optimization techniques. Extensive experiments on three benchmark datasets—under both closed-world and open-world settings—demonstrate substantial improvements over state-of-the-art methods, validating its effectiveness in semantic generalization and precise compositional alignment.

Technology Category

Application Category

📝 Abstract
Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.
Problem

Research questions and friction points this paper is trying to address.

Improving recognition of unknown state-object pairs in CZSL
Enhancing fine-grained image-composition alignment in CZSL
Optimizing primitive representation learning for compositional generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts for token-aware learning
Semantic variant alignment for image-primitives matching
Domain-expert adaption for primitive representations
🔎 Similar Papers
No similar papers found.
X
Xiao Zhang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Shaanxi, China
Yongqiang Ma
Yongqiang Ma
Wuhan University
Scientific Information MiningLarge Language ModelsAI for Science
H
Haodong Jing
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Shaanxi, China
Nanning Zheng
Nanning Zheng
Xi'an Jiaotong University