ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Multimodal autonomous agents exhibit insufficient reliability in real-world scientific workflows. Method: This paper introduces the first multimodal agent evaluation platform designed specifically for scientific practice, featuring a dynamic, visualization-rich scientific environment and 169 human-validated, cross-domain tasks (biochemistry, astronomy, GIS), supporting GUI/API/CLI interaction modalities. It proposes a benchmarking framework integrating OS-level interaction tracing, realistic scientific workflow modeling, and rigorous human verification. The platform evaluates state-of-the-art models—including GPT-4o, Claude 3.7, and UI-TARS—using GUI automation, API invocation, CLI execution, and vision-language joint reasoning. Contribution/Results: Empirical evaluation reveals that current top-performing agents achieve only 15% overall task success, exposing fundamental limitations in long-horizon planning, toolchain orchestration, and domain knowledge transfer. The platform establishes a new evaluation paradigm balancing complexity, domain diversity, and reproducibility, and provides standardized diagnostics and actionable improvement pathways.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous agents in realistic scientific workflows

Assessing LLM-based agents for interdisciplinary research tasks

Benchmarking agent performance in complex scientific discovery processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal agents interact with scientific workflows

Dynamic environment with professional software integration

Benchmark with 169 real-world scientific tasks

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation