Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses key challenges in automated question answering over electronic health records (EHRs)—including evidence retrieval, answer faithfulness, and clinical grounding—by proposing a modular, fine-tuning-free prompt optimization framework. The approach decomposes clinical QA into four sequential stages: question understanding, evidence identification, answer generation, and evidence alignment. Leveraging DSPy’s MIPROv2 optimizer, the framework automatically discovers high-performing prompts for each stage, while integrating self-consistency voting and stage-specific validation mechanisms to enhance reasoning reliability. Evaluated on the ArchEHR-QA 2026 benchmark across four subtasks, the method achieves an average rank of 4.00 (second overall), with individual rankings of 4th, 1st, 4th, and 7th, demonstrating that sophisticated prompt engineering can effectively and robustly substitute for model fine-tuning in complex clinical QA settings.

📝 Abstract

Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for the ArchEHR-QA 2026 shared task at CL4Health@LREC 2026, which comprises four subtasks: question interpretation, evidence identification, answer generation, and evidence alignment. Our approach decouples the task into independent, modular stages and employs DSPy"s MIPROv2 optimizer to automatically discover high-performing prompts, jointly tuning instructions and few-shot demonstrations for each stage. Within every stage, self-consistency voting over multiple stochastic inference runs suppresses spurious errors and improves reliability, while stage-specific verification mechanisms (e.g., self-reflection and chain-of-verification for alignment) further refine output quality. Among all teams that participated in all four subtasks, our method ranks second overall (mean rank 4.00), placing 4th, 1st, 4th, and 7th on Subtasks 1-4, respectively. These results demonstrate that systematic, per-stage prompt optimization combined with self-consistency mechanisms is a cost-effective alternative to model fine-tuning for multifaceted clinical QA.

Problem

Research questions and friction points this paper is trying to address.

clinical QA

electronic health records

evidence retrieval

answer generation

evidence grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt optimization

modular QA pipeline

self-consistency voting