Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

📅 2024-12-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the calibration problem—i.e., the alignment between output confidence scores and actual accuracy—in Retrieval-Augmented Generation (RAG) systems for healthcare applications. Motivated by the lack of quantitative analysis on confidence reliability in medical settings, we conduct the first systematic evaluation of confidence calibration across multiple large language models (LLaMA-3, Qwen2), retrieval configurations (BM25, DPR), and prompt templates, using Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) as primary metrics. Results reveal pervasive overconfidence, with calibration performance critically dependent on model architecture, retrieval ranking quality, and prompt design. Notably, we demonstrate that the ordering of retrieved documents can actively modulate generation confidence. Our work establishes a reproducible calibration assessment framework and provides actionable configuration guidelines—spanning model selection, retrieval strategy, and prompting—for trustworthy deployment of high-stakes medical RAG systems.

Technology Category

Application Category

📝 Abstract

Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications. However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored, although the confidence of information is very critical in some domains, such as finance, healthcare, and medicine. Our study focuses the impact of RAG on confidence within the medical domain under various configurations and models. We evaluate confidence by treating the model's predicted probability as its output and calculating Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores based on the probabilities and accuracy. In addition, we analyze whether the order of retrieved documents within prompts calibrates the confidence. Our findings reveal large variation in confidence and accuracy depending on the model, settings, and the format of input prompts. These results underscore the necessity of optimizing configurations based on the specific model and conditions.

Problem

Research questions and friction points this paper is trying to address.

RAG Technology

Medical Domain

Accuracy Determination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

Medical Information Retrieval

Certainty Optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow