Re:Verse -- Can Your VLM Read a Manga?

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current vision-language models (VLMs) excel at single-frame recognition but fundamentally lack temporal causal reasoning and cross-frame coherence modeling for continuous visual narratives (e.g., comics). To address this gap, we introduce the first systematic evaluation framework for long-sequence visual storytelling, built upon a benchmark comprising 308 panels from 11 chapters of *Re:Zero*. Our framework integrates fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented dialog-based reasoning. Innovatively, we propose a light-novel-aligned visual element annotation protocol and a generative narrative evaluation mechanism. Experiments expose critical VLM deficiencies in nonlinear temporal reasoning, character consistency maintenance, and inter-panel causal inference—demonstrating their absence of story-level semantic understanding. This work establishes a reproducible, modular evaluation paradigm and identifies key diagnostic pathways for advancing visual narrative comprehension.

Technology Category

Application Category

📝 Abstract

Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs' joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' failure in temporal causality and cross-panel cohesion

Assessing deep narrative reasoning beyond surface-level recognition in manga

Systematically characterizing limitations in sequential visual storytelling understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal embedding analysis for narrative understanding

Retrieval-augmented assessment framework for VLMs

Fine-grained multimodal annotation protocol with aligned text

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

2024-02-20arXiv.orgCitations: 41

Are LLMs Good Cryptic Crossword Solvers?

2024-03-15arXiv.orgCitations: 2

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4