CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Compositional zero-shot learning (CZSL) suffers from insufficient disentanglement of attribute and object semantics within global image features, limiting generalization to unseen compositions. To address this, we propose a CLIP-based fine-grained disentanglement framework: (1) local semantic features are extracted from high-level blocks of the image encoder to avoid information entanglement inherent in global representations; (2) a gated cross-modal attention mechanism enables joint alignment and separation of attributes and objects across multiple semantic dimensions; and (3) a multi-space disentanglement strategy suppresses background interference and enhances conceptual orthogonality. Evaluated on MIT-States, UT-Zappos, and C-GQA under both closed-world and open-world settings, our method achieves state-of-the-art performance, significantly improving recognition accuracy for unseen compositions.

Technology Category

Application Category

📝 Abstract
Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.
Problem

Research questions and friction points this paper is trying to address.

Disentangling attribute and object semantics in compositional zero-shot learning
Improving generalization over unseen attribute-object compositions
Capturing fine-grained semantic features while suppressing irrelevant information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Cross-Attention extracts fine-grained semantic features
Multi-Space Disentanglement separates attribute and object semantics
Semantic disentanglement improves generalization for unseen compositions