DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing vision-language models excel at large-scale image-text alignment but neglect linguistic structural composition—such as word order and predicate-argument structure—leading to suboptimal performance on syntax-sensitive tasks. To address this, we propose a syntax-aware tensor network text encoder that integrates Combinatory Categorial Grammar (CCG) parsing with distributed tensor representations, enabling interpretable and parameter-efficient, grammar-driven language modeling via high-order tensor decomposition. The encoder is coupled with a frozen CLIP visual backbone and jointly optimized end-to-end for cross-modal alignment. Our approach substantially enhances structured semantic understanding: verb accuracy on SVO-Probes improves from 77.6% to 82.4%; ARO attribute and relation scores increase by 9.2% and 4.1%, respectively; and the model achieves 93.7% accuracy on our newly constructed SVO-Swap benchmark—a diagnostic probe for syntactic robustness in vision-language grounding.

Technology Category

Application Category

📝 Abstract

Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses CLIP's neglect of linguistic compositionality and word order

Improves vision-language models' sensitivity to verb semantics and syntax

Enhances compositional reasoning while maintaining parameter efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor network text encoder encoding syntactic structure

Distributional word tensors mirroring grammatical derivation

Factorized high-order tensors reducing parameter count

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs