A Scalable Framework for Evaluating Health Language Models

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Evaluating open-ended textual responses generated by large language models (LLMs) in healthcare—particularly for metabolic health conditions (e.g., diabetes, cardiovascular disease, obesity)—relies heavily on human experts, resulting in high costs and poor scalability. Method: This paper proposes an automated evaluation framework centered on Adaptive Precise Boolean Rubrics (APB-rubrics), which replace Likert-scale assessments with a small set of targeted yes/no questions. The framework integrates human-in-the-loop evaluation, Boolean logic–based criteria modeling, and structured domain knowledge. Contribution/Results: APB-rubrics significantly improve inter-rater reliability between experts and non-experts, reduce automated evaluation time by ~50%, and outperform conventional Likert-based approaches across multiple quality dimensions—including accuracy, personalization, and safety. The approach enables scalable, cost-effective deployment and facilitates meaningful participation by non-expert annotators.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

Problem

Research questions and friction points this paper is trying to address.

Evaluating health LLM responses efficiently and rigorously

Reducing human expert reliance in health response assessment

Improving scalability of health LLM evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Precise Boolean rubrics framework

Streamlines human and automated evaluation

Higher agreement, half evaluation time

🔎 Similar Papers

No similar papers found.