MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

๐Ÿ“… 2025-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the underexplored challenge of fine-grained multimodal understanding of Japanese mangaโ€”a culturally specific narrative form characterized by tight image-text coupling, complex sequential storytelling, and domain-unique visual conventions. Method: We introduce MangaVQA, the first dedicated multimodal benchmark for manga understanding, comprising 526 human-annotated visual question-answering (VQA) pairs, alongside a complementary MangaOCR sub-benchmark for text recognition. We propose a novel evaluation paradigm integrating layout-aware reasoning, temporal narrative structure, and cultural context. Leveraging Qwen2.5-VL, we develop MangaLMMโ€”a lightweight, domain-specialized large multimodal model via OCR-VQA joint fine-tuning and scene-level granularity partitioning. Contribution/Results: MangaLMM achieves significant performance gains over state-of-the-art closed-source models (e.g., GPT-4o, Gemini 2.5) on MangaVQA, validating the efficacy of few-shot domain specialization. This work establishes the first reproducible, extensible benchmarking and modeling framework for anime/manga AI research.

Technology Category

Application Category

๐Ÿ“ Abstract
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
Problem

Research questions and friction points this paper is trying to address.

Develop benchmarks for multimodal manga understanding
Evaluate LMMs' contextual comprehension of manga narratives
Create specialized model for manga text and visual tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MangaOCR for text recognition
Develops MangaVQA for contextual understanding
Creates MangaLMM for multimodal manga tasks
๐Ÿ”Ž Similar Papers
No similar papers found.