Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

📅 2024-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the calibration problem—i.e., the alignment between output confidence scores and actual accuracy—in Retrieval-Augmented Generation (RAG) systems for healthcare applications. Motivated by the lack of quantitative analysis on confidence reliability in medical settings, we conduct the first systematic evaluation of confidence calibration across multiple large language models (LLaMA-3, Qwen2), retrieval configurations (BM25, DPR), and prompt templates, using Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) as primary metrics. Results reveal pervasive overconfidence, with calibration performance critically dependent on model architecture, retrieval ranking quality, and prompt design. Notably, we demonstrate that the ordering of retrieved documents can actively modulate generation confidence. Our work establishes a reproducible calibration assessment framework and provides actionable configuration guidelines—spanning model selection, retrieval strategy, and prompting—for trustworthy deployment of high-stakes medical RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications. However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored, although the confidence of information is very critical in some domains, such as finance, healthcare, and medicine. Our study focuses the impact of RAG on confidence within the medical domain under various configurations and models. We evaluate confidence by treating the model's predicted probability as its output and calculating Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores based on the probabilities and accuracy. In addition, we analyze whether the order of retrieved documents within prompts calibrates the confidence. Our findings reveal large variation in confidence and accuracy depending on the model, settings, and the format of input prompts. These results underscore the necessity of optimizing configurations based on the specific model and conditions.
Problem

Research questions and friction points this paper is trying to address.

RAG Technology
Medical Domain
Accuracy Determination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Medical Information Retrieval
Certainty Optimization
🔎 Similar Papers
No similar papers found.
S
Shintaro Ozaki
Nara Institute of Science and Technology
Y
Yuta Kato
The University of Tokyo
S
Siyuan Feng
The University of Tokyo
M
Masayo Tomita
The University of Tokyo
K
Kazuki Hayashi
Nara Institute of Science and Technology
R
Ryoma Obara
NEC Corporation
Masafumi Oyamada
Masafumi Oyamada
Chief Scientist, NEC Corporation
Self-Improving AIsLarge Language ModelsKnowledge Management
K
Katsuhiko Hayashi
The University of Tokyo
Hidetaka Kamigaito
Hidetaka Kamigaito
Nara Institute of Science and Technology (NAIST)
Natural Language Processing
Taro Watanabe
Taro Watanabe
Nara Institute of Science and Technology
Machine TranslationMachine Learning