TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG methods face critical bottlenecks in multi-hop question answering over heterogeneous documents (text + tables): table flattening and coarse-grained chunking severely compromise structural integrity, leading to information loss and hindering LLMs’ cross-modal reasoning. This paper proposes the first iterative four-step framework integrating natural language understanding with SQL-driven table operations—supporting context-aware query decomposition, fine-grained text retrieval, executable SQL generation and execution, and compositional intermediate answer aggregation. We further introduce HeteQA, the first benchmark explicitly designed for heterogeneous multi-hop reasoning. Extensive experiments on public datasets and HeteQA demonstrate that our method significantly outperforms existing state-of-the-art approaches, validating the efficacy of structure-preserving table interaction in enhancing multi-hop reasoning performance.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
Problem

Research questions and friction points this paper is trying to address.

Handles heterogeneous documents with text and tables
Preserves tabular structure to avoid information loss
Improves multi-hop reasoning in document question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework for text and tabular data
Iterative four-step RAG process
Novel HeteQA benchmark for evaluation
🔎 Similar Papers