SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models struggle with scientific table-based question answering tasks that require deep linguistic understanding coupled with complex numerical reasoning. To address this gap, this work introduces SciTaRC—the first expert-constructed benchmark that integrates scientific table comprehension, language-based reasoning, and advanced computation. Through a high-quality set of human-annotated question-answer pairs, the study systematically evaluates prominent language and code models, revealing that even state-of-the-art models such as Llama-3.3-70B-Instruct fail on 65.5% of tasks, with an overall failure rate no lower than 23%. These findings expose a fundamental “execution bottleneck” in the models’ ability to faithfully carry out multi-step reasoning and computational plans. This work establishes a new benchmark and offers critical insights for advancing multimodal reasoning in scientific domains.

Technology Category

Application Category

📝 Abstract
We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
Problem

Research questions and friction points this paper is trying to address.

scientific tabular data
question answering
language reasoning
complex computation
AI model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific tabular reasoning
language reasoning
complex computation
execution bottleneck
QA benchmark
🔎 Similar Papers
No similar papers found.
H
Hexuan Wang
Center for Language and Speech Processing, Johns Hopkins University
Y
Yaxuan Ren
Center for Language and Speech Processing, Johns Hopkins University
S
Srikar Bommireddypalli
Center for Language and Speech Processing, Johns Hopkins University
S
Shuxian Chen
Center for Language and Speech Processing, Johns Hopkins University
A
Adarsh Prabhudesai
Center for Language and Speech Processing, Johns Hopkins University
R
Rongkun Zhou
Center for Language and Speech Processing, Johns Hopkins University
E
Elina Baral
Center for Language and Speech Processing, Johns Hopkins University
Philipp Koehn
Philipp Koehn
Professor, Johns Hopkins University
Machine TranslationNatural Language Processing