🤖 AI Summary
Evaluating long-text clinical question-answering (QA) systems under resource constraints is challenging due to high domain specificity, low inter-annotator agreement (IAA) in manual long-answer annotation, and prohibitive annotation costs.
Method: We propose LongQAEval—a dual-granularity evaluation framework grounded in 300 real-world patient questions and physician annotations—featuring sentence-level and answer-level protocols to systematically assess correctness, relevance, and safety.
Contribution/Results: Sentence-level granularity significantly improves IAA for correctness; answer-level granularity better supports relevance assessment. Crucially, sparse sentence-level annotation achieves reliability comparable to full-sentence labeling. Cross-dimension IAA analysis and sampling efficiency quantification demonstrate that aligning evaluation granularity with task-specific dimensions enables substantial reduction in expert effort while preserving evaluation quality.
📝 Abstract
Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.