🤖 AI Summary
This work addresses the limited robustness of existing large language models (LLMs) in evaluating radiology reports across modalities and anatomical regions, where generalizable and reliable assessment metrics are lacking. The authors propose VERT, a novel LLM-driven evaluation metric that systematically compares open- and closed-source models—including both reasoning and non-reasoning variants—and integrates few-shot learning, model ensembling, and parameter-efficient fine-tuning (e.g., on Qwen3-30B). For the first time, VERT demonstrates strong alignment between LLM-based judgments and expert radiologist assessments across multimodal, multi-region radiology reports. Experiments show that VERT improves correlation with physician judgments by up to 11.7% over the GREEN metric, and fine-tuning Qwen3-30B on only 1,300 samples boosts performance by 25% while accelerating inference speed by 37.2×.
📝 Abstract
Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.