TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

📅 2024-03-28
🏛️ arXiv.org
📈 Citations: 15
Influential: 1
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) to understand and manipulate tabular data embedded in real-world office documents—particularly PDFs and Excel files. To this end, we propose TableLLM, a 13B-parameter specialized model for table understanding and manipulation. Methodologically, we introduce a distant-supervision training framework featuring (i) a reasoning-process expansion strategy to enhance logical reasoning over tables, and (ii) a cross-path verification mechanism to ensure the quality of self-generated training data. We further develop a multi-format (PDF/Excel) synthetic data pipeline and a dual-modality evaluation suite. Experiments demonstrate that TableLLM significantly outperforms both general-purpose LLMs and existing table-specialized models on a custom document-table joint benchmark. The project is fully open-sourced, including the model weights, training and inference code, evaluation benchmarks, and an interactive web demo system.

Technology Category

Application Category

📝 Abstract
We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.Our codes and data are publicly available at https://github.com/TableLLM/TableLLM.
Problem

Research questions and friction points this paper is trying to address.

LLM for tabular data
Distant supervision training
Real office usage scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large language model for tables
Distant supervision training method
Benchmarks for document and spreadsheet