๐ค AI Summary
This study addresses the underexplored challenge of fine-grained multimodal understanding of Japanese mangaโa culturally specific narrative form characterized by tight image-text coupling, complex sequential storytelling, and domain-unique visual conventions.
Method: We introduce MangaVQA, the first dedicated multimodal benchmark for manga understanding, comprising 526 human-annotated visual question-answering (VQA) pairs, alongside a complementary MangaOCR sub-benchmark for text recognition. We propose a novel evaluation paradigm integrating layout-aware reasoning, temporal narrative structure, and cultural context. Leveraging Qwen2.5-VL, we develop MangaLMMโa lightweight, domain-specialized large multimodal model via OCR-VQA joint fine-tuning and scene-level granularity partitioning.
Contribution/Results: MangaLMM achieves significant performance gains over state-of-the-art closed-source models (e.g., GPT-4o, Gemini 2.5) on MangaVQA, validating the efficacy of few-shot domain specialization. This work establishes the first reproducible, extensible benchmarking and modeling framework for anime/manga AI research.
๐ Abstract
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.