🤖 AI Summary
Existing table-based question answering benchmarks struggle to address the complex reasoning challenges posed by real-world industrial scenarios, such as multi-table relationships, nested headers, and large-scale data. To bridge this gap, this work introduces ReasonTabQA, the first high-complexity bilingual table QA benchmark tailored for authentic industrial applications, encompassing 30 domains and 1,932 tables with annotated answers and explicit reasoning chains, supporting both chain-of-thought and non-chain-of-thought paradigms. Furthermore, the authors propose TabCodeRL, a method that integrates table structure awareness with a verifiable reasoning reward mechanism, leveraging reinforcement learning to guide large language models in generating logically sound and verifiable reasoning paths. Experiments demonstrate that TabCodeRL significantly improves performance on open-source models, yet a notable gap remains compared to human-level accuracy, underscoring the inherent difficulty of industrial-scale table question answering.
📝 Abstract
Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA.