🤖 AI Summary
This paper addresses the semantic ambiguity and calibration difficulty of deterministic expressions (e.g., “possible”, “very likely”) in natural language by modeling their semantics as probability distributions over the probability simplex—replacing conventional scalar confidence scores. Methodologically, it (i) formally represents deterministic expressions as simplex-valued distributions for the first time; (ii) generalizes the notion of calibration error to distributional settings; (iii) proposes a distribution-mapping-based post-hoc calibration algorithm; and (iv) establishes a human–AI collaborative calibration analysis framework. Experiments involve radiologists and large language models, enabling quantitative, interpretable cross-subject calibration assessment and generating actionable calibration improvement recommendations. Results demonstrate significant gains in semantic consistency of uncertainty expressions and inter-subject reliability, establishing a novel paradigm for trustworthy human–AI collaborative decision-making.
📝 Abstract
We present a novel approach to calibrating linguistic expressions of certainty, e.g.,"Maybe"and"Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.