🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) to understand and manipulate tabular data embedded in real-world office documents—particularly PDFs and Excel files. To this end, we propose TableLLM, a 13B-parameter specialized model for table understanding and manipulation. Methodologically, we introduce a distant-supervision training framework featuring (i) a reasoning-process expansion strategy to enhance logical reasoning over tables, and (ii) a cross-path verification mechanism to ensure the quality of self-generated training data. We further develop a multi-format (PDF/Excel) synthetic data pipeline and a dual-modality evaluation suite. Experiments demonstrate that TableLLM significantly outperforms both general-purpose LLMs and existing table-specialized models on a custom document-table joint benchmark. The project is fully open-sourced, including the model weights, training and inference code, evaluation benchmarks, and an interactive web demo system.
📝 Abstract
We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.Our codes and data are publicly available at https://github.com/TableLLM/TableLLM.