AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic evaluation metrics (e.g., F1, ROUGE) neglect medical terminology accuracy, while human evaluation is costly and poorly scalable; mainstream LLM-based evaluators are either closed-source or lack medical domain expertise. Method: We propose AutoMedEval—the first open-source, 13B-parameter, medicine-specialized automatic evaluation model. It employs a hierarchical training paradigm integrating curriculum-based instruction tuning and iterative knowledge self-reflection, enabling robust medical assessment capability with minimal annotated data. Contribution/Results: Experiments across multiple medical QA tasks demonstrate that AutoMedEval achieves significantly higher correlation with human judgments than state-of-the-art automatic baselines. It exhibits high reliability, strong generalization across diverse medical tasks and models, and superior domain adaptability—substantially reducing reliance on manual evaluation while maintaining clinical fidelity.

Technology Category

Application Category

📝 Abstract
With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.
Problem

Research questions and friction points this paper is trying to address.

Evaluating medical LLMs' capabilities with traditional metrics overlooks key terminology
Human evaluation is costly and prone to inaccuracies in medical assessments
Existing LLM-based evaluation methods lack medical expertise or open accessibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-sourced 13B-parameter model for medical LLM evaluation
Hierarchical training with curriculum instruction tuning
Iterative knowledge introspection for medical assessment
🔎 Similar Papers
No similar papers found.
X
Xiechi Zhang
East China Normal University
Z
Zetian Ouyang
East China Normal University
L
Linlin Wang
East China Normal University
Gerard de Melo
Gerard de Melo
Professor at Hasso Plattner Institute / University of Potsdam
Artificial IntelligenceNatural Language ProcessingWeb Mining
Zhu Cao
Zhu Cao
Tsinghua University
X
Xiaoling Wang
East China Normal University
Ya Zhang
Ya Zhang
Shanghai Jiao Tong University
Machine learningComputer visionMedical Imaging
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
L
Liang He
East China Normal University