🤖 AI Summary
Existing vision-language models excel at large-scale image-text alignment but neglect linguistic structural composition—such as word order and predicate-argument structure—leading to suboptimal performance on syntax-sensitive tasks. To address this, we propose a syntax-aware tensor network text encoder that integrates Combinatory Categorial Grammar (CCG) parsing with distributed tensor representations, enabling interpretable and parameter-efficient, grammar-driven language modeling via high-order tensor decomposition. The encoder is coupled with a frozen CLIP visual backbone and jointly optimized end-to-end for cross-modal alignment. Our approach substantially enhances structured semantic understanding: verb accuracy on SVO-Probes improves from 77.6% to 82.4%; ARO attribute and relation scores increase by 9.2% and 4.1%, respectively; and the model achieves 93.7% accuracy on our newly constructed SVO-Swap benchmark—a diagnostic probe for syntactic robustness in vision-language grounding.
📝 Abstract
Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.