LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Evaluating long-text clinical question-answering (QA) systems under resource constraints is challenging due to high domain specificity, low inter-annotator agreement (IAA) in manual long-answer annotation, and prohibitive annotation costs. Method: We propose LongQAEval—a dual-granularity evaluation framework grounded in 300 real-world patient questions and physician annotations—featuring sentence-level and answer-level protocols to systematically assess correctness, relevance, and safety. Contribution/Results: Sentence-level granularity significantly improves IAA for correctness; answer-level granularity better supports relevance assessment. Crucially, sparse sentence-level annotation achieves reliability comparable to full-sentence labeling. Cross-dimension IAA analysis and sampling efficiency quantification demonstrate that aligning evaluation granularity with task-specific dimensions enables substantial reduction in expert effort while preserving evaluation quality.

Technology Category

Application Category

📝 Abstract

Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

Problem

Research questions and friction points this paper is trying to address.

Designing reliable evaluations for clinical QA systems with limited resources

Comparing coarse vs fine-grained annotation methods for medical accuracy

Improving inter-annotator agreement while reducing evaluation costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained sentence-level evaluation improves correctness agreement

Coarse answer-level evaluation enhances relevance agreement reliability

Partial sentence annotation reduces cost while maintaining reliability

🔎 Similar Papers

No similar papers found.