MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

📅 2024-05-20
🏛️ arXiv.org
📈 Citations: 28
Influential: 1
📄 PDF
🤖 AI Summary
Existing non-text-centric multilingual VQA datasets rely heavily on translation, leading to visual–text misalignment, semantic distortion, and linguistic bias. To address these limitations, we introduce MTVQA—the first high-quality, text-centric multilingual VQA benchmark—covering nine languages, 2,116 scene-text-rich images, and 6,778 expert-annotated question–answer pairs. We propose a novel human-in-the-loop multilingual annotation paradigm tailored for TEC-VQA, integrating cross-lingual consistency verification and scene-text-aware question generation to circumvent translation artifacts. MTVQA includes a multilingual training subset, and we empirically validate its utility for fine-tuning. Comprehensive evaluation on state-of-the-art MLLMs—including Qwen2-VL and GPT-4o—yields an average accuracy of only 30.9%, substantially below human performance (79.7%), confirming MTVQA’s rigor and diagnostic value. This benchmark establishes a new standard for evaluating multilingual vision–language understanding, particularly in text-grounded reasoning.

Technology Category

Application Category

📝 Abstract
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial"visual-textual misalignment"problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models~(MLLMs), including Qwen2-VL, GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA benchmark, it is evident that there is still a large room for performance improvement (Qwen2-VL scoring 30.9 versus 79.7 for human performance), underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.
Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual text-centric visual question answering gaps
Solving visual-textual misalignment in translated VQA datasets
Enhancing AI models' multilingual text-centric scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human expert annotations for multilingual TEC-VQA
Comprehensive evaluation of MLLMs on MTVQA
Multilingual training data for performance enhancement
🔎 Similar Papers
No similar papers found.
Jingqun Tang
Jingqun Tang
ByteDance Inc.
Computer VisionDocument IntelligenceMLLMMultimodal Generative Models
Q
Qi Liu
ByteDance
Y
Yongjie Ye
ByteDance
Jinghui Lu
Jinghui Lu
ByteDance Inc., School of Computer Science, University College Dublin
Natural Language ProcessingMulti-ModalityLLMHuman-in-the-loop Learning
S
Shu Wei
ByteDance
C
Chunhui Lin
ByteDance
Wanqing Li
Wanqing Li
Professor, University of Wollongong
Multimedia UnderstandingComputer VisionMachine Learning
M
Mohamad Fitri Faiz Bin Mahmood
ByteDance
H
Hao Feng
ByteDance
Z
Zhen Zhao
ByteDance
Y
Yanjie Wang
ByteDance
Y
Yuliang Liu
Huazhong University of Science and Technology
H
Hao Liu
ByteDance
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
C
Can Huang
ByteDance