Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study investigates systematic discrepancies between vision-language models and humans in generating coherent visual narratives. To this end, it introduces the first unified evaluation framework for narrative coherence that integrates multiple dimensions: referential consistency, discourse relations, thematic continuity, character persistence, and multimodal character grounding. Leveraging techniques such as coreference analysis, discourse relation classification, topic modeling, character tracking, and multimodal alignment, the framework enables fine-grained quantitative assessment. The findings reveal that, despite producing superficially fluent text, current models significantly deviate from human-like narrative structures. Moreover, joint multidimensional analysis substantially enhances the ability to detect these discrepancies, offering a new benchmark and deeper insights for advancing visual narrative modeling.

Technology Category

Application Category

📝 Abstract

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

Problem

Research questions and friction points this paper is trying to address.

narrative coherence

vision-language models

visually grounded stories

human-written narratives

discourse organization

Innovation

Methods, ideas, or system contributions that make the work stand out.

narrative coherence

vision-language models

multimodal storytelling