🤖 AI Summary
Current medical large language models (LLMs) are limited to text-only interaction and lack multimodal tool-calling capabilities, while clinical agents suffer from poor generalization across diverse clinical scenarios. To address these limitations, we propose a reflection-aware, tool-augmented agent featuring a novel two-stage reflection mechanism: (1) an optimization stage that constructs long-term tool-use experience memory, and (2) a reasoning stage that integrates retrieval with dual-path verification—iterative refinement and candidate filtering—to enable adaptive, cross-scenario tool selection. We evaluate our method on CAB, a newly constructed 18-task, multidimensional clinical agent benchmark. On the ClinicalAgent Bench, our approach achieves over a 10-point improvement over vanilla LMs and outperforms state-of-the-art agents by 3 points, significantly enhancing accuracy and robustness in complex clinical decision-making tasks.
📝 Abstract
Large Language Models (LLMs) have shown promising potential in the medical domain, assisting with tasks like clinical note generation and patient communication. However, current LLMs are limited to text-based communication, hindering their ability to interact with diverse forms of information in clinical environments. Despite clinical agents succeeding in diverse signal interaction, they are oriented to a single clinical scenario and hence fail for broader applications. To evaluate clinical agents holistically, we propose ClinicalAgent Bench~(CAB), a comprehensive medical agent benchmark consisting of 18 tasks across five key realistic clinical dimensions. Building on this, we introduce ReflecTool, a novel framework that excels at utilizing domain-specific tools within two stages. The first optimization stage progressively enlarges a long-term memory by saving successful solving processes and tool-wise experience of agents in a tiny pre-defined training set. In the following inference stage, ReflecTool can search for supportive successful demonstrations from already built long-term memory to guide the tool selection strategy, and a verifier improves the tool usage according to the tool-wise experience with two verification methods--iterative refinement and candidate selection. Extensive experiments on ClinicalAgent Benchmark demonstrate that ReflecTool surpasses the pure LLMs with more than 10 points and the well-established agent-based methods with 3 points, highlighting its adaptability and effectiveness in solving complex clinical tasks.