🤖 AI Summary
Neural NLP models often suffer from poor calibration, exhibiting overconfidence in incorrect predictions and thereby limiting their deployment in high-stakes applications. This work proposes a lightweight, inference-time uncertainty-aware attention mechanism that leverages Monte Carlo Dropout to approximate Bayesian inference, estimating token-level epistemic uncertainty and dynamically modulating the self-attention weights of pretrained Transformers—without altering the model architecture or training objective. Additionally, the authors introduce an inter-layer variance decomposition method to analyze how uncertainty accumulates across transformer layers. Experimental results demonstrate that the approach reduces expected calibration error by approximately 20% on average across SQuAD 2.0, MNLI, and SST-2, while preserving task accuracy and significantly enhancing selective prediction performance and robustness under distributional shift.
📝 Abstract
Neural NLP models are often miscalibrated, assigning high confidence to incorrect predictions, which undermines selective prediction and high-stakes deployment. Post-hoc calibration methods adjust output probabilities but leave internal computation unchanged, while ensemble and Bayesian approaches improve uncertainty at substantial training or storage cost. We propose UAT-LITE, an inference-time framework that makes self-attention uncertainty-aware using approximate Bayesian inference via Monte Carlo dropout in pretrained transformer classifiers. Token-level epistemic uncertainty is estimated from stochastic forward passes and used to modulate self-attention during contextualization, without modifying pretrained weights or training objectives. We additionally introduce a layerwise variance decomposition to diagnose how predictive uncertainty accumulates across transformer depth. Across the SQuAD 2.0 answerability, MNLI, and SST-2, UAT-LITE reduces Expected Calibration Error by approximately 20% on average relative to a fine-tuned BERT-base baseline while preserving task accuracy, and improves selective prediction and robustness under distribution shift.