SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current large language models struggle with scientific table-based question answering tasks that require deep linguistic understanding coupled with complex numerical reasoning. To address this gap, this work introduces SciTaRC—the first expert-constructed benchmark that integrates scientific table comprehension, language-based reasoning, and advanced computation. Through a high-quality set of human-annotated question-answer pairs, the study systematically evaluates prominent language and code models, revealing that even state-of-the-art models such as Llama-3.3-70B-Instruct fail on 65.5% of tasks, with an overall failure rate no lower than 23%. These findings expose a fundamental “execution bottleneck” in the models’ ability to faithfully carry out multi-step reasoning and computational plans. This work establishes a new benchmark and offers critical insights for advancing multimodal reasoning in scientific domains.

Technology Category

Application Category

📝 Abstract

We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.

Problem

Research questions and friction points this paper is trying to address.

scientific tabular data

question answering

language reasoning

complex computation

AI model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific tabular reasoning

language reasoning

complex computation