Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inconsistency and instability between large language models’ verbal expressions of uncertainty—such as using phrases like “very likely”—and their intrinsic uncertainty. Introducing the novel concept of “Marked Intrinsic Confidence” (MIC), the study proposes a comprehensive evaluation framework comprising seven metrics to systematically assess the stability and coherence of such uncertainty markers from a model-centric perspective. Through quantitative analyses, cross-task and cross-distribution experiments, and methods aligning linguistic markers with intrinsic confidence estimates, the research reveals that even under a model-centric interpretation, large language models exhibit significant miscalibration. They struggle to distinguish MIC across different data distributions and demonstrate only weak consistency in confidence ranking across tasks.

📝 Abstract

LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

Problem

Research questions and friction points this paper is trying to address.

linguistic uncertainty markers

intrinsic confidence

faithful calibration

epistemic markers

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

marker internal confidence

epistemic markers

LLM calibration