🤖 AI Summary
Compositional zero-shot learning (CZSL) suffers from insufficient disentanglement of attribute and object semantics within global image features, limiting generalization to unseen compositions. To address this, we propose a CLIP-based fine-grained disentanglement framework: (1) local semantic features are extracted from high-level blocks of the image encoder to avoid information entanglement inherent in global representations; (2) a gated cross-modal attention mechanism enables joint alignment and separation of attributes and objects across multiple semantic dimensions; and (3) a multi-space disentanglement strategy suppresses background interference and enhances conceptual orthogonality. Evaluated on MIT-States, UT-Zappos, and C-GQA under both closed-world and open-world settings, our method achieves state-of-the-art performance, significantly improving recognition accuracy for unseen compositions.
📝 Abstract
Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.