SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing general-purpose AI agents for scientific domains lack systematic reasoning architectures capable of addressing frontier scientific challenges. Method: This paper introduces X-Master, a tool-augmented agent, and X-Masters, a distributed stacking workflow—constituting the first reasoning framework explicitly designed for cutting-edge scientific problems. It employs code as a unified interface language, dynamically invoking both Python standard libraries and domain-specific scientific tools to enable breadth- and depth-aware, cross-disciplinary problem solving. Contribution/Results: On the authoritative benchmark Humanity’s Last Exam (HLE), X-Masters achieves 32.1% accuracy—the first agent to surpass the 30% threshold—and establishes a new state-of-the-art, outperforming OpenAI and Google Deep Research to rank first globally. The framework is fully open-sourced, providing a scalable architectural paradigm and empirical evaluation methodology for scientific AI agents.

Technology Category

Application Category

📝 Abstract

The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.

Problem

Research questions and friction points this paper is trying to address.

Developing general-purpose AI agents for scientific discovery acceleration

Evaluating AI agents using Humanity's Last Exam benchmark

Enhancing reasoning with tool-augmented and scalable agentic workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented reasoning agent X-Master

Code as interaction language for flexibility

Scattered-and-stacked workflow enhances reasoning

🔎 Similar Papers

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science