TableRAG: Million-Token Table Understanding with Language Models

📅 2024-10-07
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Existing language models are constrained by fixed context windows, rendering them ineffective for comprehending ultra-large tables comprising millions of tokens—leading to significant information loss and reasoning failure. To address this, we propose the first retrieval-augmented generation (RAG) framework specifically designed for large-scale tabular understanding. Our method introduces a dual-granularity joint retrieval mechanism—combining schema-aware and cell-level retrieval—to support query expansion and structured semantic matching. Furthermore, we construct TabBench, the first million-token-scale benchmark for tabular understanding evaluation. Experimental results demonstrate substantial improvements in key information recall accuracy, achieving state-of-the-art performance on our benchmark. Crucially, our approach shortens prompt length and reduces redundant input, enabling scalable and high-fidelity ultra-long table comprehension. This work establishes a novel, practical paradigm for handling massively long tabular data in foundation models.

Technology Category

Application Category

📝 Abstract
Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
Problem

Research questions and friction points this paper is trying to address.

Large-scale Tabular Data
Language Models
Information Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

TableRAG
Efficient Table Data Processing
Large-scale Dataset Performance
🔎 Similar Papers
No similar papers found.