🤖 AI Summary
This study addresses the lack of clinically oriented evaluation metrics for automated medical imaging report generation—particularly in histopathology. We propose HARE, the first evaluation framework centered on entities and relations. Methodologically, we develop two complementary models—HARE-NER and HARE-RE—built upon GatorTronS, jointly performing named entity recognition and relation extraction, and introduce an interpretable HARE scoring metric. We also release the first large-scale, expert-annotated pathology report dataset with entity-relation annotations. Experiments show that HARE-NER and HARE-RE achieve 0.915 F1 score on entity and relation identification. Moreover, the HARE metric significantly outperforms ROUGE, METEOR, and the radiology-specific RadGraph-XL in expert-assessed clinical relevance and demonstrates superior regression performance. This work fills a critical gap in quantitatively assessing the clinical quality of pathology report generation and establishes a new benchmark for clinical evaluation in the field.
📝 Abstract
Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation), a novel entity and relation centric framework, composed of a benchmark dataset, a named entity recognition (NER) model, a relation extraction (RE) model, and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities and relations between reference and generated reports. To develop the HARE benchmark, we annotated 813 de-identified clinical diagnostic histopathology reports and 652 histopathology reports from The Cancer Genome Atlas (TCGA) with domain-specific entities and relations. We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER and HARE-RE which achieved the highest overall F1-score (0.915) among the tested models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics such as RadGraph-XL, with the highest correlation and the best regression to expert evaluations (higher than the second best method, GREEN, a large language model based radiology report evaluator, by Pearson $r = 0.168$, Spearman $ρ= 0.161$, Kendall $τ= 0.123$, $R^2 = 0.176$, $RMSE = 0.018$). We release HARE, datasets, and the models at https://github.com/knowlab/HARE to foster advancements in histopathology report generation, providing a robust framework for improving the quality of reports.