๐ค AI Summary
This study investigates systematic discrepancies between vision-language models and humans in generating coherent visual narratives. To this end, it introduces the first unified evaluation framework for narrative coherence that integrates multiple dimensions: referential consistency, discourse relations, thematic continuity, character persistence, and multimodal character grounding. Leveraging techniques such as coreference analysis, discourse relation classification, topic modeling, character tracking, and multimodal alignment, the framework enables fine-grained quantitative assessment. The findings reveal that, despite producing superficially fluent text, current models significantly deviate from human-like narrative structures. Moreover, joint multidimensional analysis substantially enhances the ability to detect these discrepancies, offering a new benchmark and deeper insights for advancing visual narrative modeling.
๐ Abstract
We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.