WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored challenge of cross-modal reasoning over tables and figures in long documents. We introduce the first long-document question-answering benchmark focused on multi-source information fusion—comprising 1,000 multiple-choice questions spanning seven Wikipedia domains—that demands multi-hop reasoning across text, tables, and figures. Methodologically, we structurally extract image-text data from real Wikipedia pages and manually construct questions requiring both long-context retrieval and cross-modal synthesis; we then conduct zero-shot evaluation on 12 state-of-the-art vision-language models (VLMs). Key contributions include: (i) the first systematic assessment of VLMs’ multimodal reasoning capabilities in realistic long-document settings; (ii) empirical revelation of a severe performance gap under long-context conditions (e.g., GPT-4o achieves only 50.2%, open-source models ≤27%), contrasting with ~70% accuracy in direct-context settings—highlighting a critical bottleneck in long-range multimodal understanding; and (iii) public release of the benchmark and evaluation code.

Technology Category

Application Category

📝 Abstract
Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-modal reasoning over tables and charts
Assessing VLLMs' performance on long-context vision inputs
Benchmarking complex reasoning with multi-modal document synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for table and chart QA
Evaluates cross-modal reasoning with MCQs
Tests long-context vision-language model performance