🤖 AI Summary
The radiology NLP community lacks systematic evaluation of large language models (LLMs) on clinical report interpretation and impression generation. Method: We introduce the first radiology-specific, multilingual (Chinese–English), unified benchmark for comprehensive assessment of 32 LLMs on generating clinical impressions from imaging findings. Leveraging a standardized real-world radiology report dataset, we employ human-validated, multidimensional metrics—accuracy, clinical plausibility, and safety—to evaluate model performance. Contribution/Results: Our analysis reveals substantial inter-model disparities in medical terminology comprehension, causal reasoning, and safety boundary adherence. Notably, several models achieve clinically deployable performance across key metrics. This work establishes the first rigorous, domain-specific LLM evaluation framework for radiology, addressing a critical gap in medical AI assessment and providing empirical guidance for model selection and refinement in clinical deployment.
📝 Abstract
The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.