Efficient Table Retrieval and Understanding with Multimodal Large Language Models

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently retrieving and comprehending user queries over large-scale tabular images to support accurate question answering. To this end, the authors propose TabRAG, a novel framework that introduces, for the first time, a three-stage pipeline—retrieval, reranking, and reasoning—into multimodal table understanding. The framework employs a vision–language joint model for initial retrieval, leverages a multimodal large language model (MLLM) to finely rerank candidate tables, and generates answers conditioned on the selected tables. Evaluated on a newly constructed dataset comprising 88,161 training and 9,819 test samples, TabRAG achieves substantial performance gains, improving retrieval recall by 7.0% and answer accuracy by 6.1%.

Technology Category

Application Category

📝 Abstract
Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

table retrieval
table understanding
multimodal large language models
visual table QA
document image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Table Retrieval
Visual-Text Foundation Models
Table Understanding
Reranking
🔎 Similar Papers