🤖 AI Summary
Existing self-driving laboratories (SDLs) rely on static experimental protocols, limiting their ability to emulate scientists’ adaptive reasoning and intuition in dynamic environments. Method: We propose AILA, the first large language model (LLM)-based autonomous agent system for end-to-end atomic force microscopy (AFM) experimentation—encompassing experimental design, execution, analysis, and closed-loop decision-making. Contribution/Results: We introduce AFMBench, the first benchmark for evaluating LLMs in AFM-driven scientific discovery, uncovering critical deficiencies in multi-agent coordination (73% failure rate), instruction following, and safety alignment, while empirically delineating LLMs’ scientific reasoning boundaries. Leveraging task-decomposition prompting, hardware interface integration, and a multi-agent architecture, AILA achieves autonomous AFM calibration, high-resolution feature identification, and nanomechanical property quantification. Results further reveal substantial accuracy degradation in foundational tasks (e.g., document retrieval), underscoring robustness and trustworthiness as central challenges in AI for Science.
📝 Abstract
The emergence of large language models (LLMs) has accelerated the development of self-driving laboratories (SDLs) for materials research. Despite their transformative potential, current SDL implementations rely on rigid, predefined protocols that limit their adaptability to dynamic experimental scenarios across different labs. A significant challenge persists in measuring how effectively AI agents can replicate the adaptive decision-making and experimental intuition of expert scientists. Here, we introduce AILA (Artificially Intelligent Lab Assistant), a framework that automates atomic force microscopy (AFM) through LLM-driven agents. Using AFM as an experimental testbed, we develop AFMBench-a comprehensive evaluation suite that challenges AI agents based on language models like GPT-4o and GPT-3.5 to perform tasks spanning the scientific workflow: from experimental design to results analysis. Our systematic assessment shows that state-of-the-art language models struggle even with basic tasks such as documentation retrieval, leading to a significant decline in performance in multi-agent coordination scenarios. Further, we observe that LLMs exhibit a tendency to not adhere to instructions or even divagate to additional tasks beyond the original request, raising serious concerns regarding safety alignment aspects of AI agents for SDLs. Finally, we demonstrate the application of AILA on increasingly complex experiments open-ended experiments: automated AFM calibration, high-resolution feature detection, and mechanical property measurement. Our findings emphasize the necessity for stringent benchmarking protocols before deploying AI agents as laboratory assistants across scientific disciplines.