π€ AI Summary
Current MLVQA models suffer from data scarcity and inaccurate evaluation in real-world multilingual handwritten document understanding, lacking benchmarks that jointly address linguistic diversity, handwriting complexity, and robustness to OCR systems without ground-truth transcriptions. To bridge this gap, we introduce HW-MLVQAβthe first benchmark tailored to realistic multilingual handwritten document visual question answering. It comprises 1,600 handwritten pages and 2,400 cross-lingual question-answer pairs, supporting unimodal (text/image) and multimodal (text+image) evaluation. We propose a novel OCR-robust evaluation protocol that operates without ground-truth transcriptions, alongside a unified assessment framework integrating multimodal large language models, OCR engines, and VQA modules. HW-MLVQA serves as a standardized testbed for both proprietary and open-source OCR systems, substantially enhancing evaluability, reproducibility, and practical deployability of handwritten document understanding systems.
π Abstract
The proliferation of MultiLingual Visual Question Answering (MLVQA) benchmarks augments the capabilities of large language models (LLMs) and multi-modal LLMs, thereby enabling them to adeptly capture the intricate linguistic subtleties and visual complexities inherent across diverse languages. Despite its potential, the current MLVQA model struggles to fully utilize its capabilities when dealing with the extensive variety of handwritten documents. This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Multilingual Handwritten document comprehension. HW-MLVQA encompasses an extensive collection of 1,600 handwritten Pages complemented by 2,400 question-answers. Furthermore, it provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality. To simulate authentic real-world contexts devoid of ground truth textual transcriptions, we facilitates a rigorous assessment of proprietary and open-source OCR models. The benchmark aspires to facilitate pivotal advancements in multilingual handwritten document interpretation, fostering innovation and scholarly inquiry within this specialized domain.