🤖 AI Summary
In retrieval-augmented generation (RAG) systems, unreliable confidence estimation from large language models (LLMs) hinders high-stakes decision-making in domains such as finance and healthcare. To address this, we propose a lightweight uncertainty modeling method grounded in feed-forward network (FFN) activation values: specifically, we directly leverage raw FFN activations from layer 16 of Llama 3.1-8B as autoregressive confidence signals—bypassing softmax to preserve information fidelity. Confidence prediction is formulated as a sequence classification task, optimized with Huber loss to enhance robustness against noisy human annotations. Evaluated on a real-world financial customer service benchmark under stringent latency constraints, our approach significantly outperforms strong baselines. Crucially, it achieves high accuracy and low inference latency using only a single-layer activation, eliminating the need for auxiliary modules or fine-tuning. This work establishes a deployable, architecture-aware confidence estimation paradigm for trustworthy RAG.
📝 Abstract
We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.