MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses the underexplored challenge of fine-grained multimodal understanding of Japanese manga—a culturally specific narrative form characterized by tight image-text coupling, complex sequential storytelling, and domain-unique visual conventions. Method: We introduce MangaVQA, the first dedicated multimodal benchmark for manga understanding, comprising 526 human-annotated visual question-answering (VQA) pairs, alongside a complementary MangaOCR sub-benchmark for text recognition. We propose a novel evaluation paradigm integrating layout-aware reasoning, temporal narrative structure, and cultural context. Leveraging Qwen2.5-VL, we develop MangaLMM—a lightweight, domain-specialized large multimodal model via OCR-VQA joint fine-tuning and scene-level granularity partitioning. Contribution/Results: MangaLMM achieves significant performance gains over state-of-the-art closed-source models (e.g., GPT-4o, Gemini 2.5) on MangaVQA, validating the efficacy of few-shot domain specialization. This work establishes the first reproducible, extensible benchmarking and modeling framework for anime/manga AI research.

Technology Category

Application Category

📝 Abstract

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

Problem

Research questions and friction points this paper is trying to address.

Develop benchmarks for multimodal manga understanding

Evaluate LMMs' contextual comprehension of manga narratives

Create specialized model for manga text and visual tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MangaOCR for text recognition

Develops MangaVQA for contextual understanding

Creates MangaLMM for multimodal manga tasks

🔎 Similar Papers

One missing piece in Vision and Language: A Survey on Comics Understanding