🤖 AI Summary
Existing non-text-centric multilingual VQA datasets rely heavily on translation, leading to visual–text misalignment, semantic distortion, and linguistic bias. To address these limitations, we introduce MTVQA—the first high-quality, text-centric multilingual VQA benchmark—covering nine languages, 2,116 scene-text-rich images, and 6,778 expert-annotated question–answer pairs. We propose a novel human-in-the-loop multilingual annotation paradigm tailored for TEC-VQA, integrating cross-lingual consistency verification and scene-text-aware question generation to circumvent translation artifacts. MTVQA includes a multilingual training subset, and we empirically validate its utility for fine-tuning. Comprehensive evaluation on state-of-the-art MLLMs—including Qwen2-VL and GPT-4o—yields an average accuracy of only 30.9%, substantially below human performance (79.7%), confirming MTVQA’s rigor and diagnostic value. This benchmark establishes a new standard for evaluating multilingual vision–language understanding, particularly in text-grounded reasoning.
📝 Abstract
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial"visual-textual misalignment"problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models~(MLLMs), including Qwen2-VL, GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA benchmark, it is evident that there is still a large room for performance improvement (Qwen2-VL scoring 30.9 versus 79.7 for human performance), underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.