Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit bottlenecks in comprehending long, dense image captions, primarily due to insufficient compositional reasoning—specifically, object-attribute binding and inter-object relational inference. This work systematically investigates the interplay between compositionality and long-caption understanding, proposing a contrastive learning-based joint training framework that simultaneously optimizes object-level binding, relational reasoning, and long-caption alignment. Key contributions include: (i) the first empirical validation of a bidirectional mutual enhancement between compositional reasoning and long-caption comprehension; and (ii) identification of high-quality, grounded long-text data and fine-grained task design as critical determinants of generalization. Experiments on a newly constructed high-quality long-caption dataset demonstrate that our model significantly outperforms baselines on both long-caption cross-modal retrieval and compositional benchmarks (e.g., CLEVR, NLVR²), confirming that joint modeling synergistically improves these two fundamental capabilities.

Technology Category

Application Category

📝 Abstract
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, but understanding long, dense captions remains an open challenge. We hypothesize that compositionality, the capacity to reason about object-attribute bindings and inter-object relationships, is key to understanding longer captions. In this paper, we investigate the interaction between compositionality and long-caption understanding, asking whether training for one property enhances the other. We train and evaluate a range of models that target each of these capabilities. Our results reveal a bidirectional relationship: compositional training improves performance on long-caption retrieval, and training on long captions promotes compositionality. However, these gains are sensitive to data quality and model design. We find that training on poorly structured captions, or with limited parameter updates, fails to support generalization. Likewise, strategies that aim at retaining general alignment, such as freezing positional embeddings, do not improve compositional understanding. Overall, we find that compositional understanding and long-caption understanding are intertwined capabilities that can be jointly learned through training on dense, grounded descriptions. Despite these challenges, we show that models trained on high-quality, long-caption data can achieve strong performance in both tasks, offering practical guidance for improving VLM generalization.
Problem

Research questions and friction points this paper is trying to address.

Understanding long dense captions remains challenging for vision-language models
Investigating whether compositional training improves long-caption understanding capabilities
Examining the bidirectional relationship between compositionality and long-caption comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training models on compositional data improves long-caption retrieval
Training on long captions enhances compositional reasoning capabilities
Joint learning from dense grounded descriptions achieves strong performance
🔎 Similar Papers
No similar papers found.