Automatic Evaluation of Healthcare LLMs Beyond Question-Answering

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical large language models (LLMs) have long suffered from a fragmented evaluation paradigm—open-ended assessment emphasizes paraphrasing and lacks verifiability, while closed-ended assessment prioritizes factual accuracy but neglects generative fluency; their correlation and coverage gaps remain poorly characterized. To address this, we propose a multidimensional medical LLM evaluation framework: (1) a novel open-generation metric, Relaxed Perplexity, which relaxes strict token-order constraints of conventional perplexity; (2) CareQA, a hybrid benchmark jointly optimizing factual correctness and expressive richness; and (3) the first empirical characterization of coupling relationships and coverage deficiencies across multi-axis evaluation metrics in medical LLM assessment. Experiments demonstrate that Relaxed Perplexity achieves a 23.6% improvement in inter-physician rating consistency over baseline metrics, and CareQA significantly enhances the joint evaluation of factual grounding and generative quality.

Technology Category

Application Category

📝 Abstract
Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark--CareQA--, with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to mitigate the identified limitations.
Problem

Research questions and friction points this paper is trying to address.

Evaluate healthcare LLMs beyond QA benchmarks
Identify gaps in current LLM evaluation methods
Propose new metric for open-ended assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-axis evaluation suite
CareQA medical benchmark
Relaxed Perplexity metric
🔎 Similar Papers
No similar papers found.
Anna Arias-Duart
Anna Arias-Duart
Barcelona Supercomputing Center (BSC)
Artificial Intelligence
P
Pablo Agustin Martin-Torres
Barcelona Supercomputing Center (BSC)
Daniel Hinjos
Daniel Hinjos
Research Engineer, Barcelona Supercomputing Center
Artificial IntelligenceDeep LearningInterpretabilityBioinformatics
P
Pablo Bernabeu-Perez
Barcelona Supercomputing Center (BSC)
Lucia Urcelay Ganzabal
Lucia Urcelay Ganzabal
ML Research Scientist, Center for Genomic Regulation
Protein DesignGenerative ModelsDeep LearningFoundation Models
M
Marta Gonzalez Mallo
Barcelona Supercomputing Center (BSC)
A
Ashwin Kumar Gururajan
Barcelona Supercomputing Center (BSC)
E
Enrique Lopez-Cuena
Barcelona Supercomputing Center (BSC)
S
Sergio Alvarez-Napagao
Barcelona Supercomputing Center (BSC), Universitat Politècnica de Catalunya (UPC)–BarcelonaTech
D
Dario Garcia-Gasulla
Barcelona Supercomputing Center (BSC)