TabRAG: Tabular Document Retrieval via Structured Language Representations

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak semantic representation of tabular content in embedding models degrades retrieval and generation performance in table-intensive document parsing for RAG. To address this, we propose TabRAG—the first framework to explicitly integrate structured linguistic representations (e.g., row/column relationships, cell-level semantics) into a parsing-aware RAG pipeline. TabRAG enhances semantic retrievability and generation readiness of tabular data through structure-aware document parsing and hierarchical table representation learning. Evaluated on multiple table-centric RAG benchmarks—including Tat-QA and WikiTableQuestions-RAG—TabRAG achieves state-of-the-art results, outperforming mainstream parsing methods by +12.3% in retrieval accuracy and significantly improving answer generation quality (+8.7 BLEU, +9.5 F1). The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.
Problem

Research questions and friction points this paper is trying to address.

Improving tabular data extraction in document parsing
Enhancing retrieval-augmented generation for table-heavy documents
Overcoming performance limitations of parsing-based RAG methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses structured language representations for parsing
Optimizes RAG pipeline for table-heavy documents
Outperforms existing parsing-based retrieval methods
🔎 Similar Papers
J
Jacob Si
Imperial College London
M
Mike Qu
Columbia University
M
Michelle Lee
Imperial College London
Yingzhen Li
Yingzhen Li
Imperial College London
Artificial IntelligenceMachine LearningStatistics