ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing scientific reasoning benchmarks suffer from narrow disciplinary coverage, oversimplified answer formats, and susceptibility to data contamination, limiting their ability to assess large language models’ complex, real-world scientific reasoning capabilities. To address these limitations, we introduce ATLAS—a cross-disciplinary, high-difficulty evaluation benchmark spanning seven core scientific domains (e.g., mathematics, physics, chemistry), comprising ~800 expert-crafted, original questions requiring multi-step reasoning, LaTeX-formatted expressions, and domain-specific knowledge. We propose a novel evaluation framework ensuring contamination resistance, cross-disciplinary validity, and high-quality automated scoring via an LLM-based adjudication panel validated through expert adversarial review. Experiments demonstrate that ATLAS effectively discriminates between state-of-the-art models’ scientific reasoning proficiencies. We envision ATLAS as an open, long-term, community-driven benchmark for assessing AGI-level scientific reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks fail to distinguish frontier AI model capabilities
Current evaluations lack cross-disciplinary scientific reasoning assessment
Simplified answer formats create fidelity gaps with real scientific inquiry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created a multidisciplinary benchmark with 800 original problems
Employs expert peer review and adversarial testing for quality
Uses LLM judges for automated assessment of complex answers
🔎 Similar Papers
No similar papers found.
H
Hongwei Liu
Shanghai AI Laboratory
J
Junnan Liu
Shanghai AI Laboratory
Shudong Liu
Shudong Liu
University of Macau
Natural Language ProcessingLarge Language Models
Haodong Duan
Haodong Duan
Shanghai AI Lab | CUHK | PKU
Computer VisionVideo UnderstandingMultimodal LearningGenerative AI
Yuqiang Li
Yuqiang Li
Central South University
Internal Combustion EngineCombustionEmissionsMechansim
Mao Su
Mao Su
Shanghai AI Laboratory
PhysicsAI
X
Xiaohong Liu
Shanghai AI Laboratory
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays
X
Xinyu Fang
Shanghai AI Laboratory
Q
Qianhong Ma
Shanghai AI Laboratory
Taolin Zhang
Taolin Zhang
Hefei University of Technology
LLMVLLMDeep Learning
Zihan Ma
Zihan Ma
Xi'an Jiaotong University
NLPSocial NetworkMulti Modal Learning
Y
Yufeng Zhao
Shanghai AI Laboratory
P
Peiheng Zhou
Shanghai AI Laboratory
L
Linchen Xiao
Shanghai AI Laboratory
W
Wenlong Zhang
Shanghai AI Laboratory
S
Shijie Zhou
Shanghai AI Laboratory
X
Xingjian Ma
Shanghai AI Laboratory
S
Siqi Sun
Shanghai AI Laboratory
J
Jiaye Ge
Shanghai AI Laboratory
M
Meng Li
Shanghai AI Laboratory
Yuhong Liu
Yuhong Liu
Santa Clara University
Trustworthy AISecurity and PrivacyIoTBlockchainSocial network
J
Jianxin Dong
Shanghai AI Laboratory
J
Jiaying Li
Shanghai AI Laboratory
H
Hui Wu
Shanghai AI Laboratory