🤖 AI Summary
Large language models (LLMs) exhibit poor reasoning robustness, sensitivity to table scale, and degraded performance on complex queries in temporal tabular question answering. Method: We propose a SQL-driven symbolic reasoning framework that maps temporal tables to symbolic intermediate representations aligned with database schemas, coupled with a context-aware adaptive few-shot prompting mechanism—enabling, for the first time, synergistic reasoning between LLMs and structured query logic. Contribution/Results: Evaluated on our newly constructed synthetic benchmark TempTabQA-C, our approach achieves state-of-the-art performance across multi-scale temporal QA tasks. It significantly improves accuracy on complex queries, generalization across temporal patterns, and bias resistance, while enhancing scalability and robustness. This work establishes a new benchmark for LLM-based temporal tabular reasoning.
📝 Abstract
Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.