Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

📅 2024-08-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently and cost-effectively evaluating answer quality in open-ended health-domain LLM question answering, this paper proposes a retrieval-ranking-based automatic assessment method. It is the first to adapt a supervised retrieval ranking model—trained on the CLEF 2021 eHealth dataset—to the task of LLM answer quality discrimination, implicitly modeling answer credibility via document relevance annotations and thereby eliminating reliance on expensive expert judgments. The method achieves strong alignment with health-domain expert preferences (Kendall’s τ = 0.64), significantly outperforming baseline approaches. Furthermore, empirical analysis confirms the positive impact of both LLM scale expansion and prompt engineering on answer quality. This work establishes a scalable, reliable, and generalizable automated evaluation paradigm for domain-specific LLMs.

Technology Category

Application Category

📝 Abstract
Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's $ au=0.64$). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.
Problem

Research questions and friction points this paper is trying to address.

Language Models
Healthcare
Quality Assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-assessment
Large Language Model
Healthcare Question Answering
🔎 Similar Papers
No similar papers found.
S
Sebastian Heineking
Leipzig University
J
Jonas Probst
Leipzig University
D
Daniel Steinbach
University of Leipzig Medical Center
Martin Potthast
Martin Potthast
University of Kassel, hessian.AI, and ScaDS.AI
Information RetrievalNatural Language Processing
H
Harrisen Scells
Leipzig University, University of Kassel and hessian.AI