🤖 AI Summary
This paper addresses the challenge of structured semantic understanding in visual narratives (e.g., comics). We propose a hierarchical multimodal knowledge graph framework that decomposes narratives into three granular levels: story arcs, event segments, and panels—unifying semantic and spatiotemporal modeling across levels. Our key innovation is a novel multi-granularity alignment mechanism enabling panel-level visual–textual coupling and cross-level symbolic reasoning. The framework constructs multimodal graphs, fuses them hierarchically, and adapts to a manually annotated Manga109 subset. Evaluated on four tasks—action retrieval, dialogue tracking, character mapping, and panel temporal reconstruction—it achieves high precision and recall. Experiments demonstrate significant advantages in interpretability, consistent multimodal representation, and cross-task generalization.
📝 Abstract
This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on multimodal media such as comics. The proposed method decomposes narrative content into multiple levels, from macro-level story arcs to fine-grained event segments. It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships. At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions. These graphs are integrated across narrative levels to support reasoning over story structure, character continuity, and event progression. We apply our approach to a manually annotated subset of the Manga109 dataset and demonstrate its ability to support symbolic reasoning across diverse narrative tasks, including action retrieval, dialogue tracing, character appearance mapping, and panel timeline reconstruction. Evaluation results show high precision and recall across tasks, validating the coherence and interpretability of the framework. This work contributes a scalable foundation for narrative-based content analysis, interactive storytelling, and multimodal reasoning in visual media.