SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose AI agents for scientific domains lack systematic reasoning architectures capable of addressing frontier scientific challenges. Method: This paper introduces X-Master, a tool-augmented agent, and X-Masters, a distributed stacking workflow—constituting the first reasoning framework explicitly designed for cutting-edge scientific problems. It employs code as a unified interface language, dynamically invoking both Python standard libraries and domain-specific scientific tools to enable breadth- and depth-aware, cross-disciplinary problem solving. Contribution/Results: On the authoritative benchmark Humanity’s Last Exam (HLE), X-Masters achieves 32.1% accuracy—the first agent to surpass the 30% threshold—and establishes a new state-of-the-art, outperforming OpenAI and Google Deep Research to rank first globally. The framework is fully open-sourced, providing a scalable architectural paradigm and empirical evaluation methodology for scientific AI agents.

Technology Category

Application Category

📝 Abstract
The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.
Problem

Research questions and friction points this paper is trying to address.

Developing general-purpose AI agents for scientific discovery acceleration
Evaluating AI agents using Humanity's Last Exam benchmark
Enhancing reasoning with tool-augmented and scalable agentic workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented reasoning agent X-Master
Code as interaction language for flexibility
Scattered-and-stacked workflow enhances reasoning
🔎 Similar Papers
No similar papers found.
Jingyi Chai
Jingyi Chai
Shanghai Jiao Tong University
Large Language ModelFederated Learning
S
Shuo Tang
School of Artificial Intelligence, Shanghai Jiao Tong University
R
Rui Ye
School of Artificial Intelligence, Shanghai Jiao Tong University
Yuwen Du
Yuwen Du
Shanghai Jiao Tong University
Multi-AgentAutonomous Driving Simulation
X
Xinyu Zhu
School of Artificial Intelligence, Shanghai Jiao Tong University
M
Mengcheng Zhou
School of Artificial Intelligence, Shanghai Jiao Tong University
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Weinan E
Weinan E
Professor of Mathematics, Princeton University
applied mathematics
Siheng Chen
Siheng Chen
Shanghai Jiao Tong University
Collective intelligenceLLM agentgraph signal processingcollaborative perception