Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

πŸ“… 2026-03-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the pervasive hallucination issues in current generative vision-language models (VLMs) when performing page-level semantic understanding of comics, which critically hinders effective access for visually impaired users. The work presents the first systematic taxonomy of hallucinations specific to comic understanding and introduces a novel evaluation benchmark tailored for accessibility applications. Through human-in-the-loop analysis, it reveals significant deficiencies in existing models’ semantic coherence and contextual reasoning capabilities, while also highlighting the inadequacy of relying solely on semantic similarity metrics for evaluation. Building on these insights, the authors propose targeted data refinement and hallucination mitigation strategies, laying a foundational framework for developing reliable and interpretable VLMs that support accessible comic comprehension.

Technology Category

Application Category

πŸ“ Abstract
A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
Problem

Research questions and friction points this paper is trying to address.

comic understanding
visual-language models
hallucination
accessibility
page-level interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination taxonomy
comic understanding
vision-language models
benchmarking
accessibility
C
Christopher Driggers-Ellis
University of Florida, Gainesville, FL 32611
N
Nachiketh Tibrewal
University of Florida, Gainesville, FL 32611
R
Rohit Bogulla
University of Florida, Gainesville, FL 32611
H
Harsh Khanna
University of Florida, Gainesville, FL 32611
Sangpil Youm
Sangpil Youm
Ph.D Student, University of Florida
Natural Language ProcessingArtificial IntelligenceNetwork Science
Christan Grant
Christan Grant
Associate Professor, University of Florida
Interactive Machine LearningNatural Language ProcessingVisualizationData MiningPrivacy
Bonnie Dorr
Bonnie Dorr
University of Florida, IHMC, UMD
Artificial IntelligenceNatural Language ProcessingMachine Translation