HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark

πŸ“… 2025-07-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current MLVQA models suffer from data scarcity and inaccurate evaluation in real-world multilingual handwritten document understanding, lacking benchmarks that jointly address linguistic diversity, handwriting complexity, and robustness to OCR systems without ground-truth transcriptions. To bridge this gap, we introduce HW-MLVQAβ€”the first benchmark tailored to realistic multilingual handwritten document visual question answering. It comprises 1,600 handwritten pages and 2,400 cross-lingual question-answer pairs, supporting unimodal (text/image) and multimodal (text+image) evaluation. We propose a novel OCR-robust evaluation protocol that operates without ground-truth transcriptions, alongside a unified assessment framework integrating multimodal large language models, OCR engines, and VQA modules. HW-MLVQA serves as a standardized testbed for both proprietary and open-source OCR systems, substantially enhancing evaluability, reproducibility, and practical deployability of handwritten document understanding systems.

Technology Category

Application Category

πŸ“ Abstract
The proliferation of MultiLingual Visual Question Answering (MLVQA) benchmarks augments the capabilities of large language models (LLMs) and multi-modal LLMs, thereby enabling them to adeptly capture the intricate linguistic subtleties and visual complexities inherent across diverse languages. Despite its potential, the current MLVQA model struggles to fully utilize its capabilities when dealing with the extensive variety of handwritten documents. This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Multilingual Handwritten document comprehension. HW-MLVQA encompasses an extensive collection of 1,600 handwritten Pages complemented by 2,400 question-answers. Furthermore, it provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality. To simulate authentic real-world contexts devoid of ground truth textual transcriptions, we facilitates a rigorous assessment of proprietary and open-source OCR models. The benchmark aspires to facilitate pivotal advancements in multilingual handwritten document interpretation, fostering innovation and scholarly inquiry within this specialized domain.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multilingual handwritten document understanding
Providing a comprehensive VQA benchmark for diverse languages
Evaluating OCR models in real-world multilingual contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

HW-MLVQA benchmark for multilingual handwritten documents
Integrates text, image, and multimodal evaluation framework
Assesses OCR models without ground truth transcriptions
πŸ”Ž Similar Papers
No similar papers found.
Aniket Pal
Aniket Pal
CVIT, IIIT Hyderabad
A
Ajoy Mondal
CVIT, IIIT Hyderabad
M
Minesh Mathew
CVIT, IIIT Hyderabad
C. V. Jawahar
C. V. Jawahar
CVIT, IIIT Hyderabad, India
Computer Vision