🤖 AI Summary
Estimating epistemic uncertainty in large language models is typically computationally expensive, limiting its applicability in safety-critical scenarios. This work proposes an efficient paradigm that leverages a small draft model to estimate token-level epistemic uncertainty through knowledge distillation combined with bias-variance decomposition: Jensen–Shannon divergence approximates variance, while KL divergence captures bias. To enhance accuracy, the approach integrates Online Stochastic Distillation (OSD) and Data-Diversified Drafting (DDD) strategies. Notably, it avoids the need for full ensembles, achieving up to a 37% reduction in RMSE on GSM8K while matching the hallucination detection performance of high-overhead methods like TokUR—all with negligible inference overhead.
📝 Abstract
Quantifying uncertainty in Large Language Models (LLMs) is essential for mitigating hallucinations and enabling risk-aware deployment in safety-critical tasks. However, estimating Epistemic Uncertainty(EU) via Deep Ensembles is computationally prohibitive at the scale of modern models. We propose a framework that leverages the small draft models to efficiently estimate token-level EU, bypassing the need for full-scale ensembling. Theoretically grounded in a Bias-Variance Decomposition, our approach approximates EU via Jensen-Shannon divergence among drafts (variance proxy) and KL divergence between the draft mixture and the target (bias proxy). To further ensure accuracy without significant overhead, we introduce Online Stochastic Distillation (OSD) to efficiently approximate target aggregation and the Data-Diverse Drafts (DDD) strategy to enhance draft diversity for better target approximation. Extensive experiments on GSM8K demonstrate that our method reduces the estimation error (RMSE) by up to 37% compared to baselines. Crucially, our approach achieves Hallucination Detection performance competitive with heavy perturbation-based methods like TokUR while incurring negligible inference costs, offering a practical solution for uncertainty-aware LLM deployment.