🤖 AI Summary
Large multimodal language models (MLLMs) exhibit substantial discrepancies from humans in coreference distribution within multimodal narratives, reflecting limitations in cross-modal contextual tracking.
Method: We propose coreference patterns as a proxy metric to quantify this capability and introduce a visual storytelling evaluation framework featuring coreference chain analysis, cross-modal consistency scoring, and entropy-driven referential diversity measurement—enabling the first multidimensional assessment of hybrid coreference stability.
Results: Experiments reveal that while MLLMs improve textual fluency, they significantly underperform humans in image-text joint coreference consistency and interleaved entity referencing, exposing constraints in implicit cross-modal context modeling. This work establishes an interpretable, quantitative evaluation paradigm for multimodal coreference modeling, advancing both diagnostic rigor and model development in multimodal NLP.
📝 Abstract
We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality.