A Scalable Framework for Evaluating Health Language Models

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating open-ended textual responses generated by large language models (LLMs) in healthcare—particularly for metabolic health conditions (e.g., diabetes, cardiovascular disease, obesity)—relies heavily on human experts, resulting in high costs and poor scalability. Method: This paper proposes an automated evaluation framework centered on Adaptive Precise Boolean Rubrics (APB-rubrics), which replace Likert-scale assessments with a small set of targeted yes/no questions. The framework integrates human-in-the-loop evaluation, Boolean logic–based criteria modeling, and structured domain knowledge. Contribution/Results: APB-rubrics significantly improve inter-rater reliability between experts and non-experts, reduce automated evaluation time by ~50%, and outperform conventional Likert-based approaches across multiple quality dimensions—including accuracy, personalization, and safety. The approach enables scalable, cost-effective deployment and facilitates meaningful participation by non-expert annotators.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
Problem

Research questions and friction points this paper is trying to address.

Evaluating health LLM responses efficiently and rigorously
Reducing human expert reliance in health response assessment
Improving scalability of health LLM evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Precise Boolean rubrics framework
Streamlines human and automated evaluation
Higher agreement, half evaluation time
🔎 Similar Papers
No similar papers found.
Neil Mallinar
Neil Mallinar
UC San Diego
deep learninglearning theorykernel methodsgraph theorynatural language processing
A
A. Ali Heydari
Google Research
X
Xin Liu
Google Research
A
Anthony Z. Faranesh
Google Research
B
Brent Winslow
Google Research
N
Nova Hammerquist
Google Research
B
Benjamin Graef
Google Research, Work done at Google via Vituity
C
Cathy Speed
Google Research
M
Mark Malhotra
Google Research
Shwetak Patel
Shwetak Patel
University of Washington, Washington Research Foundation Endowed Professor, Computer Science
Ubiquitous ComputingHuman-Computer InteractionSensorsEmbedded Systems
J
Javier L. Prieto
Google Research
Daniel McDuff
Daniel McDuff
Google and University of Washington
Affective ComputingDeep LearningHuman-Computer InteractionHuman-Centered AIComputer Vision
A
Ahmed A. Metwally
Google Research