A Conditional Probability Framework for Compositional Zero-shot Learning

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Compositional zero-shot learning (CZSL) faces the core challenge of modeling semantic dependencies between attributes and objects, yet conventional decoupled approaches neglect intra-compositional contextual constraints and conditional associations. To address this, we propose a conditional-probability-based cross-modal framework that decomposes compositional recognition into object likelihood and attribute-conditioned likelihood. Textual descriptions guide object representation learning, while text–image cross-attention enables fine-grained semantic alignment. Furthermore, we introduce a semantics-driven region highlighting module to jointly optimize attribute and object prediction. Evaluated on multiple CZSL benchmarks, our method achieves substantial improvements in unseen composition accuracy, demonstrating the effectiveness and generalizability of explicitly modeling contextual dependencies.

Technology Category

Application Category

📝 Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., "striped" applies to "zebra" or "shirts" but not "sky" or "water"), while the same attribute can manifest differently depending on context (e.g., "young" in "young tree" vs. "young dog"). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL. In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our approach. Code is available at here.

Problem

Research questions and friction points this paper is trying to address.

Recognize unseen attribute-object compositions in zero-shot learning

Model semantic dependencies between attributes and objects

Enhance feature learning for better contextual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Probability Framework models dependencies

Textual descriptors enhance object feature learning

Cross-attention aligns attributes contextually

🔎 Similar Papers

Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning