🤖 AI Summary
Existing CZSL benchmarks rely solely on single-attribute annotations, ignoring the inherent semantic co-occurrence and interdependence among multiple attributes—leading to annotation bias and flawed evaluation. To address this, we introduce MAC, the first zero-shot learning benchmark supporting multi-attribute composition: it comprises 18,217 images annotated with 11,067 fine-grained attribute combinations, averaging 30.2 attributes per object. MAC is the first to systematically model higher-order semantic synergies among attributes. We propose the MM-encoder, which disentangles attribute and object representations and incorporates graph-structured modeling to capture complex, high-order attribute correlations. Evaluated on MAC, our approach achieves substantial gains in multi-attribute composition recognition accuracy. This work shifts CZSL evaluation from oversimplified single-attribute assumptions toward realistic, compositional scenarios and establishes a new standard benchmark for rigorous assessment.
📝 Abstract
Compositional Zero-Shot Learning (CZSL) aims to learn semantic primitives (attributes and objects) from seen compositions and recognize unseen attribute-object compositions. Existing CZSL datasets focus on single attributes, neglecting the fact that objects naturally exhibit multiple interrelated attributes. Real-world objects often possess multiple interrelated attributes, and current datasets' narrow attribute scope and single attribute labeling introduce annotation biases, undermining model performance and evaluation. To address these limitations, we introduce the Multi-Attribute Composition (MAC) dataset, encompassing 18,217 images and 11,067 compositions with comprehensive, representative, and diverse attribute annotations. MAC includes an average of 30.2 attributes per object and 65.4 objects per attribute, facilitating better multi-attribute composition predictions. Our dataset supports deeper semantic understanding and higher-order attribute associations, providing a more realistic and challenging benchmark for the CZSL task. We also develop solutions for multi-attribute compositional learning and propose the MM-encoder to disentangling the attributes and objects.