🤖 AI Summary
To address the challenge of efficiently and cost-effectively evaluating answer quality in open-ended health-domain LLM question answering, this paper proposes a retrieval-ranking-based automatic assessment method. It is the first to adapt a supervised retrieval ranking model—trained on the CLEF 2021 eHealth dataset—to the task of LLM answer quality discrimination, implicitly modeling answer credibility via document relevance annotations and thereby eliminating reliance on expensive expert judgments. The method achieves strong alignment with health-domain expert preferences (Kendall’s τ = 0.64), significantly outperforming baseline approaches. Furthermore, empirical analysis confirms the positive impact of both LLM scale expansion and prompt engineering on answer quality. This work establishes a scalable, reliable, and generalizable automated evaluation paradigm for domain-specific LLMs.
📝 Abstract
Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's $ au=0.64$). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.