DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models excel at large-scale image-text alignment but neglect linguistic structural composition—such as word order and predicate-argument structure—leading to suboptimal performance on syntax-sensitive tasks. To address this, we propose a syntax-aware tensor network text encoder that integrates Combinatory Categorial Grammar (CCG) parsing with distributed tensor representations, enabling interpretable and parameter-efficient, grammar-driven language modeling via high-order tensor decomposition. The encoder is coupled with a frozen CLIP visual backbone and jointly optimized end-to-end for cross-modal alignment. Our approach substantially enhances structured semantic understanding: verb accuracy on SVO-Probes improves from 77.6% to 82.4%; ARO attribute and relation scores increase by 9.2% and 4.1%, respectively; and the model achieves 93.7% accuracy on our newly constructed SVO-Swap benchmark—a diagnostic probe for syntactic robustness in vision-language grounding.

Technology Category

Application Category

📝 Abstract
Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses CLIP's neglect of linguistic compositionality and word order
Improves vision-language models' sensitivity to verb semantics and syntax
Enhances compositional reasoning while maintaining parameter efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor network text encoder encoding syntactic structure
Distributional word tensors mirroring grammatical derivation
Factorized high-order tensors reducing parameter count
🔎 Similar Papers
No similar papers found.
Kin Ian Lo
Kin Ian Lo
University College London
Quantum NLPContextuality
H
Hala Hawashin
University College London
M
Mina Abbaszadeh
University College London
T
Tilen Limback-Stokin
University College London
H
Hadi Wazni
University College London
Mehrnoosh Sadrzadeh
Mehrnoosh Sadrzadeh
Professor of Computer Science, Royal Academy of Engineering Research Chair,University College London
LogicCategorial GrammarsCompositional Distributional SemanticsMachine Learning