Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

📅 2024-11-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Compositional Zero-Shot Learning (CZSL) faces three key challenges: (1) background interference hinders sufficient disentanglement of attributes and objects; (2) conventional word embeddings inadequately capture multimodal semantics; and (3) models exhibit overconfidence on seen compositions, impairing generalization to unseen attribute–object pairs. To address these, we propose TRIDENT—a unified framework featuring: (1) discriminative multimodal word embeddings derived from the final hidden states of a Multimodal Large Language Model (MLLM); (2) a learnable conditional masking mechanism enabling fine-grained, background-suppressed multi-granularity feature disentanglement; and (3) LLM-generated auxiliary attributes coupled with attribute smoothing regularization to mitigate overconfidence. TRIDENT achieves state-of-the-art performance across three standard CZSL benchmarks, significantly improving accuracy on unseen compositions. It is the first work to jointly integrate MLLM hidden-state representations, conditional disentanglement masks, and LLM-driven attribute smoothing within the CZSL paradigm.

Technology Category

Application Category

📝 Abstract
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Recognize novel attribute-object compositions from seen ones
Improve disentanglement of attributes and objects in images
Address overconfidence in seen compositions for better generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature adaptive aggregation reduces background impact
MLLM embeddings enhance semantic representation
Attribute smoothing with LLM addresses overconfidence
X
Xudong Yan
School of Computer Science and Technology, Beijing Jiaotong University
Songhe Feng
Songhe Feng
Professor in School of Computer Science and Technology, Beijing Jiaotong University
multi-view learningzero-shot learningtest-time adaptation
Y
Yang Zhang
School of Computer Science and Technology, Beijing Jiaotong University
J
Jian Yang
Qifu Technology
Y
Yueguan Lin
Qifu Technology
H
Haojun Fei
Qifu Technology