CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Compositional zero-shot learning (CZSL) suffers from insufficient disentanglement of attribute and object semantics within global image features, limiting generalization to unseen compositions. To address this, we propose a CLIP-based fine-grained disentanglement framework: (1) local semantic features are extracted from high-level blocks of the image encoder to avoid information entanglement inherent in global representations; (2) a gated cross-modal attention mechanism enables joint alignment and separation of attributes and objects across multiple semantic dimensions; and (3) a multi-space disentanglement strategy suppresses background interference and enhances conceptual orthogonality. Evaluated on MIT-States, UT-Zappos, and C-GQA under both closed-world and open-world settings, our method achieves state-of-the-art performance, significantly improving recognition accuracy for unseen compositions.

Technology Category

Application Category

📝 Abstract

Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.

Problem

Research questions and friction points this paper is trying to address.

Disentangling attribute and object semantics in compositional zero-shot learning

Improving generalization over unseen attribute-object compositions

Capturing fine-grained semantic features while suppressing irrelevant information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Cross-Attention extracts fine-grained semantic features

Multi-Space Disentanglement separates attribute and object semantics

Semantic disentanglement improves generalization for unseen compositions

🔎 Similar Papers

Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning