TabRAG: Tabular Document Retrieval via Structured Language Representations

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Weak semantic representation of tabular content in embedding models degrades retrieval and generation performance in table-intensive document parsing for RAG. To address this, we propose TabRAG—the first framework to explicitly integrate structured linguistic representations (e.g., row/column relationships, cell-level semantics) into a parsing-aware RAG pipeline. TabRAG enhances semantic retrievability and generation readiness of tabular data through structure-aware document parsing and hierarchical table representation learning. Evaluated on multiple table-centric RAG benchmarks—including Tat-QA and WikiTableQuestions-RAG—TabRAG achieves state-of-the-art results, outperforming mainstream parsing methods by +12.3% in retrieval accuracy and significantly improving answer generation quality (+8.7 BLEU, +9.5 F1). The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.

Problem

Research questions and friction points this paper is trying to address.

Improving tabular data extraction in document parsing

Enhancing retrieval-augmented generation for table-heavy documents

Overcoming performance limitations of parsing-based RAG methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses structured language representations for parsing

Optimizes RAG pipeline for table-heavy documents

Outperforms existing parsing-based retrieval methods

🔎 Similar Papers

TableRAG: Million-Token Table Understanding with Language Models

2024-10-07arXiv.orgCitations: 3

Authors to Follow