How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?

📅 2024-12-12

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates the feasibility and limitations of large language models (LLMs) as core reasoning engines for radiology-specific AI agents. Method: We introduce RadA-BenchPlat—the first multimodal, multi-anatomic-region, multi-disease benchmark for radiology agents—comprising 2,200 physician-validated synthetic patient cases and 24,200 question-answer pairs. To enhance agent capabilities, we propose novel prompting paradigms: prompt-backpropagation, multi-agent collaboration, and dynamic tool generation. Contribution/Results: Our methods improve complex task completion by 48.2% and achieve 65.4% success rate in automated tool construction. Evaluated across seven state-of-the-art LLMs—including Claude-3.7-Sonnet—the framework attains a 67.1% success rate on standard radiology tasks. All code and data are publicly released to support reproducible research and methodological advancement in clinical AI agent development.

Technology Category

Application Category

📝 Abstract

We introduce RadA-BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist-verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question-answer pairs that simulate diverse clinical situations. The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs, revealing that while models like Claude-3.7-Sonnet can achieve a 67.1% task completion rate in routine settings, they still struggle with complex task understanding and tool coordination, limiting their capacity to serve as the central core of automated radiology systems. By incorporating four advanced prompt engineering strategies--where prompt-backpropagation and multi-agent collaboration contributed 16.8% and 30.7% improvements, respectively--the performance for complex tasks was enhanced by 48.2% overall. Furthermore, automated tool building was explored to improve robustness, achieving a 65.4% success rate, thereby offering promising insights for the future integration of fully automated radiology applications into clinical practice. All of our code and data are openly available at https://github.com/MAGIC-AI4Med/RadABench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs as agent cores in radiology using synthetic data

Assesses tool coordination and complex task understanding in LLMs

Explores prompt engineering and automation for radiology system integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs using synthetic radiology patient records

Uses advanced prompt engineering for performance boost

Explores automated tool building for robustness

🔎 Similar Papers

No similar papers found.