🤖 AI Summary
This paper formally defines the novel task of Panel Sequence Segmentation (PSS) for comic pages, establishing a structured foundation for downstream applications such as character analysis and narrative indexing. To address the ambiguous inter-panel boundaries arising from heterogeneous text–image layouts and variable page designs in comics, we propose a vision-dominant, text–vision collaborative multimodal Transformer architecture. We further introduce ComicPSS—the first large-scale, manually annotated dataset comprising 20,800 comic pages. Our model jointly encodes local visual features and global textual semantics, significantly enhancing segmentation robustness in ambiguous cases. Extensive experiments demonstrate state-of-the-art performance across multiple metrics: F1-Macro, Panoptic Quality, and sequence-level accuracy—outperforming both conventional layout analysis methods and general-purpose vision–language models. This work advances comic content understanding toward scalability and fine-grained structural parsing.
📝 Abstract
This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.