🤖 AI Summary
This study investigates the modular replaceability of large language model (LLM)-driven autonomous agents in drug discovery—specifically, whether the LLM backbone and agent architecture (tool-calling vs. code-generation) can be decoupled and independently substituted. We systematically evaluate leading models (Claude, GPT, Llama) across diverse drug discovery subtasks and employ LLM-as-a-judge for automated performance assessment. Results show: (1) model performance is highly task-dependent, with Claude-3.5/3.7-Sonnet and GPT-4o significantly outperforming other backbones; (2) code-generation agents generally surpass tool-calling agents, but this advantage is not universal and requires co-design with model capabilities; (3) module replacement is not plug-and-play—it necessitates targeted prompt engineering reconstruction. This work provides the first empirical characterization of key constraints and optimization pathways for modular LLM agent design in scientific discovery.
📝 Abstract
Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery and design. In this study, we critically examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the agentic system such as the LLM are interchangeable, a topic that has received limited attention in drug discovery applications. We compare the performance of different large language models (LLMs) and the effectiveness of tool-calling agents versus code-generating agents in this domain. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question and model dependent. Furthermore, the impact of replacing system prompts is dependent on the specific question asked and the model used, underscoring that -- even in this particular domain -- one cannot just replace language models without considering prompt re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of stable and scalable solutions for real-world problems.