🤖 AI Summary
To address inaccurate numerical computation in free-form table question answering, this paper proposes TabLaP: a framework that employs large language models (LLMs) as multi-step reasoning planners—not answer generators—while delegating all numerical operations to a Python interpreter to ensure computational precision. TabLaP introduces, for the first time, a confidence quantification module grounded in execution traces and uncertainty modeling, enabling regret-aware decision-making. Crucially, it decouples symbolic reasoning from numeric computation, facilitating synergistic collaboration between LLMs and deterministic numerical engines. Evaluated on two mainstream benchmarks, TabLaP achieves absolute improvements of 5.7% and 5.8% in answer accuracy over prior state-of-the-art methods, demonstrating substantial gains in both reliability and performance.
📝 Abstract
Question answering on free-form tables (a.k.a. TableQA) is a challenging task because of the flexible structure and complex schema of tables. Recent studies use Large Language Models (LLMs) for this task, exploiting their capability in understanding the questions and tabular data, which are typically given in natural language and contain many textual fields, respectively. While this approach has shown promising results, it overlooks the challenges brought by numerical values which are common in tabular data, and LLMs are known to struggle with such values. We aim to address this issue, and we propose a model named TabLaP that uses LLMs as a planner rather than an answer generator. This approach exploits LLMs' capability in multi-step reasoning while leaving the actual numerical calculations to a Python interpreter for accurate calculation. Recognizing the inaccurate nature of LLMs, we further make a first attempt to quantify the trustworthiness of the answers produced by TabLaP, such that users can use TabLaP in a regret-aware manner. Experimental results on two benchmark datasets show that TabLaP is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.