π€ AI Summary
This work addresses a key limitation in existing biomedical question-answering approaches, which treat retrieved literature as flat text and thus fail to model the reliability and uncertainty of supporting evidence. To overcome this, the authors propose BELIEF, a novel framework that first converts documents into structured evidence objects and then performs complementary reasoning through two distinct pathways: a symbolic path grounded in DempsterβShafer evidence theory and a neural path leveraging large language models. A reliability-aware arbitration module subsequently fuses the outputs of both paths, explicitly representing evidence attributes, pathway disagreement, and decision uncertainty. This design significantly enhances evidence utilization efficiency. Evaluated on PubMedQA, MedQA, and MedMCQA, BELIEF achieves state-of-the-art performance in 25 out of 30 experimental settings, matching or surpassing specialized biomedical models.
π Abstract
Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.