🤖 AI Summary
Current large language models struggle with scientific table-based question answering tasks that require deep linguistic understanding coupled with complex numerical reasoning. To address this gap, this work introduces SciTaRC—the first expert-constructed benchmark that integrates scientific table comprehension, language-based reasoning, and advanced computation. Through a high-quality set of human-annotated question-answer pairs, the study systematically evaluates prominent language and code models, revealing that even state-of-the-art models such as Llama-3.3-70B-Instruct fail on 65.5% of tasks, with an overall failure rate no lower than 23%. These findings expose a fundamental “execution bottleneck” in the models’ ability to faithfully carry out multi-step reasoning and computational plans. This work establishes a new benchmark and offers critical insights for advancing multimodal reasoning in scientific domains.
📝 Abstract
We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.