🤖 AI Summary
Large language models (LLMs) suffer from hallucination and miscalibrated uncertainty under distributional shift, leading to insufficient coverage and unreliable prediction sets. To address this, we propose Domain-Aware Prediction Sets with Calibration (DAPC), the first conformal prediction framework explicitly incorporating distribution shift awareness. DAPC dynamically reweights calibration examples based on a context-aware distance metric that quantifies similarity between test prompts and the calibration set, enabling adaptive uncertainty calibration of LLM outputs. We theoretically establish its finite-sample coverage guarantee under distributional shift. Extensive experiments on benchmarks including MMLU demonstrate that DAPC significantly improves coverage accuracy over standard conformal methods under substantial domain shifts, while maintaining computational efficiency. Our approach thus enhances both the reliability and practical applicability of uncertainty quantification for LLMs.
📝 Abstract
Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.