Exploring Modularity of Agentic Systems for Drug Discovery

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates the modular replaceability of large language model (LLM)-driven autonomous agents in drug discovery—specifically, whether the LLM backbone and agent architecture (tool-calling vs. code-generation) can be decoupled and independently substituted. We systematically evaluate leading models (Claude, GPT, Llama) across diverse drug discovery subtasks and employ LLM-as-a-judge for automated performance assessment. Results show: (1) model performance is highly task-dependent, with Claude-3.5/3.7-Sonnet and GPT-4o significantly outperforming other backbones; (2) code-generation agents generally surpass tool-calling agents, but this advantage is not universal and requires co-design with model capabilities; (3) module replacement is not plug-and-play—it necessitates targeted prompt engineering reconstruction. This work provides the first empirical characterization of key constraints and optimization pathways for modular LLM agent design in scientific discovery.

Technology Category

Application Category

📝 Abstract

Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery and design. In this study, we critically examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the agentic system such as the LLM are interchangeable, a topic that has received limited attention in drug discovery applications. We compare the performance of different large language models (LLMs) and the effectiveness of tool-calling agents versus code-generating agents in this domain. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question and model dependent. Furthermore, the impact of replacing system prompts is dependent on the specific question asked and the model used, underscoring that -- even in this particular domain -- one cannot just replace language models without considering prompt re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of stable and scalable solutions for real-world problems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating modularity of LLM-based agentic systems in drug discovery

Comparing performance of different LLMs and agent types

Assessing impact of prompt replacement on model effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular LLM-based agentic systems for drug discovery

Comparing tool-calling versus code-generating agents

Evaluating LLM performance with chemistry-specific metrics

🔎 Similar Papers

DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction