Coreference as an indicator of context scope in multimodal narrative

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal language models (MLLMs) exhibit substantial discrepancies from humans in coreference distribution within multimodal narratives, reflecting limitations in cross-modal contextual tracking. Method: We propose coreference patterns as a proxy metric to quantify this capability and introduce a visual storytelling evaluation framework featuring coreference chain analysis, cross-modal consistency scoring, and entropy-driven referential diversity measurement—enabling the first multidimensional assessment of hybrid coreference stability. Results: Experiments reveal that while MLLMs improve textual fluency, they significantly underperform humans in image-text joint coreference consistency and interleaved entity referencing, exposing constraints in implicit cross-modal context modeling. This work establishes an interpretable, quantitative evaluation paradigm for multimodal coreference modeling, advancing both diagnostic rigor and model development in multimodal NLP.

Technology Category

Application Category

📝 Abstract
We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality.
Problem

Research questions and friction points this paper is trying to address.

Compare coreference distribution in humans and machines
Quantify coreferential patterns in multimodal storytelling
Assess machine ability to track mixed references
Innovation

Methods, ideas, or system contributions that make the work stand out.

Metrics quantify coreferential patterns in texts.
Humans maintain consistency across texts and images.
Machines struggle with tracking mixed references.
🔎 Similar Papers
No similar papers found.