🤖 AI Summary
This study addresses the challenge that current large language models (LLMs) struggle to accurately interpret user intent and perform reliable predictive reasoning in implicit predictive table-based question answering. To this end, the work introduces TopBench, the first systematically defined benchmark for this task, comprising four subtasks: point prediction, decision-making, treatment effect analysis, and complex filtering, all requiring models to generate both explanatory reasoning text and structured tabular outputs. Evaluating mainstream LLMs through combined textual and agent-based workflows reveals a prevalent tendency to misinterpret implicit predictive queries as simple lookup tasks. The findings underscore that precise intent disambiguation is critical for improving predictive performance and highlight the necessity of integrating advanced modeling capabilities to surpass current performance ceilings.
📝 Abstract
Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.