🤖 AI Summary
This work addresses the inconsistency and instability between large language models’ verbal expressions of uncertainty—such as using phrases like “very likely”—and their intrinsic uncertainty. Introducing the novel concept of “Marked Intrinsic Confidence” (MIC), the study proposes a comprehensive evaluation framework comprising seven metrics to systematically assess the stability and coherence of such uncertainty markers from a model-centric perspective. Through quantitative analyses, cross-task and cross-distribution experiments, and methods aligning linguistic markers with intrinsic confidence estimates, the research reveals that even under a model-centric interpretation, large language models exhibit significant miscalibration. They struggle to distinguish MIC across different data distributions and demonstrate only weak consistency in confidence ranking across tasks.
📝 Abstract
LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.