CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This paper formally defines the novel task of Panel Sequence Segmentation (PSS) for comic pages, establishing a structured foundation for downstream applications such as character analysis and narrative indexing. To address the ambiguous inter-panel boundaries arising from heterogeneous text–image layouts and variable page designs in comics, we propose a vision-dominant, text–vision collaborative multimodal Transformer architecture. We further introduce ComicPSS—the first large-scale, manually annotated dataset comprising 20,800 comic pages. Our model jointly encodes local visual features and global textual semantics, significantly enhancing segmentation robustness in ambiguous cases. Extensive experiments demonstrate state-of-the-art performance across multiple metrics: F1-Macro, Panoptic Quality, and sequence-level accuracy—outperforming both conventional layout analysis methods and general-purpose vision–language models. This work advances comic content understanding toward scalability and fine-grained structural parsing.

Technology Category

Application Category

📝 Abstract

This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.

Problem

Research questions and friction points this paper is trying to address.

Develops CoSMo for comic book page segmentation

Creates annotated dataset for multimodal analysis

Outperforms baselines in visual and multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Transformer for comic segmentation

Vision-only and multimodal model variants

New annotated dataset with 20,800 pages

🔎 Similar Papers

No similar papers found.