🤖 AI Summary
This work addresses the lack of mechanistic understanding and systematic evaluation of tool-augmented large language models (LLMs) in chemistry. We introduce ChemAgent, a chemistry-specific agent designed to comparatively evaluate performance across two distinct task categories: expert-level synthetic route prediction and general chemistry question answering. Methodologically, ChemAgent extends the ChemCrow framework by integrating domain-specific tools—including molecular modeling and reaction prediction—and incorporating multi-step reasoning chains, dynamic tool selection, and human-in-the-loop verification. Our key contribution is the first expert-driven error analysis revealing that tool augmentation is not universally beneficial: while it substantially improves synthetic prediction accuracy, the base LLM outperforms its tool-augmented counterpart on general chemistry QA tasks—by up to 18.3% absolute accuracy. This demonstrates that knowledge-intensive reasoning often outweighs tool invocation capability, challenging the prevailing assumption that tool integration inherently enhances LLM performance in chemistry.
📝 Abstract
To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.