Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP exhibits fragility in compositional reasoning (object-attribute-relation), often degenerating into bag-of-words matching—primarily because existing causal explanations neglect token-level linguistic structure, failing to characterize prompt sensitivity and failure mechanisms on hard negatives. This work introduces the first token-level structural causal model for CLIP, uncovering the root cause: its contrastive objective does not guarantee compositional semantic identifiability—i.e., “compositional non-identifiability”—enabling spurious optimal text encoders that preserve cross-modal alignment while completely discarding critical compositional operations. We theoretically establish this via block-wise identifiability analysis, iterative compositional operator modeling, and a targeted hard-negative mining strategy. The framework provides both a novel analytical tool and an interpretable optimization pathway to enhance compositional generalization in vision-language models.

Technology Category

Application Category

📝 Abstract
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.
Problem

Research questions and friction points this paper is trying to address.

Explaining CLIP's compositional brittleness through token-level causal analysis
Identifying pseudo-optimal encoders insensitive to atomic concept operations
Linking language-side nonidentifiability to visual failures via modality gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-aware causal representation learning framework
Extends block identifiability to tokenized text
Explains compositional brittleness via nonidentifiability
🔎 Similar Papers
No similar papers found.
Ziliang Chen
Ziliang Chen
AP, Pengcheng Lab
Machine learningFoundation ModelsMultimodal Embodied Intelligence
T
Tianang Xiao
Hong Kong University of Science and Technology (Guangzhou)
J
Jusheng Zhang
Sun Yat-sen University
Yongsen Zheng
Yongsen Zheng
Nanyang Technological University / Sun Yat-sen University
Recommender SystemHuman-AI Dialogue SystemNatural Language ProcessingTrustworthy AIAI Safety
X
Xipeng Chen
Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory