🤖 AI Summary
Existing medical report generation evaluation metrics focus solely on keyword coverage, neglecting clinically critical fine-grained aspects—such as lesion location, severity, and diagnostic uncertainty—leading to incomplete reliability assessment. To address this, we propose the first multi-agent collaborative framework for trustworthy radiology report evaluation. Our method integrates anatomical- and lesion-level fine-grained parsing via named entity recognition (NER-F1) and introduces an LLM-driven subjective dimension scoring module, enabling structured feedback and clinical interpretability. Evaluated on Rexval and RadEvalX, our framework achieves Kendall correlation coefficients of 0.70 and 0.54 with expert ratings—substantially outperforming prevailing metrics. The code and interactive demo system are publicly available.
📝 Abstract
Automatic medical report generation supports clinical diagnosis, reduces the workload of radiologists, and holds the promise of improving diagnosis consistency. However, existing evaluation metrics primarily assess the accuracy of key medical information coverage in generated reports compared to human-written reports, while overlooking crucial details such as the location and certainty of reported abnormalities. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs NER-F1 calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = 0.70 for Rexval dataset and Kendall coefficient = 0.54 for RadEvalX dataset). The anonymous project demo is available at: https://github.com/Zhenxuan-Zhang/GEMA_score.