Evaluating Large Language Models for Radiology Natural Language Processing

📅 2023-07-25

📈 Citations: 6

✨ Influential: 0

career value

153K/year

🤖 AI Summary

The radiology NLP community lacks systematic evaluation of large language models (LLMs) on clinical report interpretation and impression generation. Method: We introduce the first radiology-specific, multilingual (Chinese–English), unified benchmark for comprehensive assessment of 32 LLMs on generating clinical impressions from imaging findings. Leveraging a standardized real-world radiology report dataset, we employ human-validated, multidimensional metrics—accuracy, clinical plausibility, and safety—to evaluate model performance. Contribution/Results: Our analysis reveals substantial inter-model disparities in medical terminology comprehension, causal reasoning, and safety boundary adherence. Notably, several models achieve clinically deployable performance across key metrics. This work establishes the first rigorous, domain-specific LLM evaluation framework for radiology, addressing a critical gap in medical AI assessment and providing empirical guidance for model selection and refinement in clinical deployment.

📝 Abstract

The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

Problem

Research questions and friction points this paper is trying to address.

Evaluating large language models for radiology NLP tasks

Assessing LLMs' ability to interpret radiology reports

Testing models' performance in deriving clinical impressions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated thirty two large language models

Assessed radiology report interpretation capability

Tested impression derivation from radiologic findings

🔎 Similar Papers

No similar papers found.