ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing scientific reasoning benchmarks suffer from narrow disciplinary coverage, oversimplified answer formats, and susceptibility to data contamination, limiting their ability to assess large language models’ complex, real-world scientific reasoning capabilities. To address these limitations, we introduce ATLAS—a cross-disciplinary, high-difficulty evaluation benchmark spanning seven core scientific domains (e.g., mathematics, physics, chemistry), comprising ~800 expert-crafted, original questions requiring multi-step reasoning, LaTeX-formatted expressions, and domain-specific knowledge. We propose a novel evaluation framework ensuring contamination resistance, cross-disciplinary validity, and high-quality automated scoring via an LLM-based adjudication panel validated through expert adversarial review. Experiments demonstrate that ATLAS effectively discriminates between state-of-the-art models’ scientific reasoning proficiencies. We envision ATLAS as an open, long-term, community-driven benchmark for assessing AGI-level scientific reasoning capabilities.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks fail to distinguish frontier AI model capabilities

Current evaluations lack cross-disciplinary scientific reasoning assessment

Simplified answer formats create fidelity gaps with real scientific inquiry

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created a multidisciplinary benchmark with 800 original problems

Employs expert peer review and adversarial testing for quality

Uses LLM judges for automated assessment of complex answers

🔎 Similar Papers

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

2024-06-18Neural Information Processing SystemsCitations: 12

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

2024-06-13arXiv.orgCitations: 12

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Data Scientist, Evaluations - Meta Superintelligence Labs