🤖 AI Summary
Compositional Zero-Shot Learning (CZSL) faces three key challenges: (1) background interference hinders sufficient disentanglement of attributes and objects; (2) conventional word embeddings inadequately capture multimodal semantics; and (3) models exhibit overconfidence on seen compositions, impairing generalization to unseen attribute–object pairs. To address these, we propose TRIDENT—a unified framework featuring: (1) discriminative multimodal word embeddings derived from the final hidden states of a Multimodal Large Language Model (MLLM); (2) a learnable conditional masking mechanism enabling fine-grained, background-suppressed multi-granularity feature disentanglement; and (3) LLM-generated auxiliary attributes coupled with attribute smoothing regularization to mitigate overconfidence. TRIDENT achieves state-of-the-art performance across three standard CZSL benchmarks, significantly improving accuracy on unseen compositions. It is the first work to jointly integrate MLLM hidden-state representations, conditional disentanglement masks, and LLM-driven attribute smoothing within the CZSL paradigm.
📝 Abstract
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.