🤖 AI Summary
This study addresses the challenge of improving answer accuracy in clinical question answering over electronic health records (EHRs), where precise retrieval of contextually relevant sentences and generation of concise, traceable answers are critical. We propose a two-stage LLM pipeline: (1) fine-grained sentence retrieval via few-shot prompting and self-consistent sampling, augmented by a dynamic-threshold sentence classification mechanism; and (2) answer generation conditioned on retrieved sentences. A key empirical finding is that an 8B-parameter model substantially outperforms a 70B-parameter model on EHR sentence-level retrieval—highlighting that retrieval precision, rather than model scale, governs downstream answer quality. Our approach achieves state-of-the-art performance on the ArchEHR-QA 2025 benchmark, demonstrating that lightweight models combined with controllable, reasoning-aware retrieval strategies yield superior efficacy and practicality in medical-domain QA.
📝 Abstract
We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician's question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.