🤖 AI Summary
This study addresses the challenge that users of large language models (LLMs) often overtrust and accept incorrect responses, while existing uncertainty representations fail to capture fine-grained variations in answer reliability. The work presents the first systematic investigation into how three levels of uncertainty granularity—output-level, relation-level, and token-level—affect users’ verification behaviors. Integrating uncertainty quantification, multi-granular visualizations, and large-scale human-AI interaction experiments, the research reveals distinct ways in which each granularity modulates user trust and checking strategies. Notably, token-level uncertainty enhances user alignment with AI outputs, whereas relation-level uncertainty unexpectedly suppresses self-verification behavior. These findings offer empirical grounding and novel design insights for expressing uncertainty in responsible LLM systems.
📝 Abstract
Despite warnings that LLMs can make mistakes, users often develop inappropriate trust and accept incorrect answers without critical evaluation. Uncertainty quantification (UQ), displaying LLMs' confidence, has emerged as a promising approach to calibrate user trust. However, prior empirical studies on uncertainty communication have treated uncertainty as a single numerical score or simple natural language expression. This simplification fails to capture a key property of LLM outputs: a single response often comprises multiple claims and reasoning steps, each with distinct levels of uncertainty. To address this gap, this study investigates uncertainty granularity (i.e., the extent to which uncertainty is expressed at different levels within an LLM response) and examines its impact on LLM-assisted decision-making. We conducted a large-scale, between-subjects study (N=192) in which participants answered medical questions using LLMs that displayed uncertainty at three different granularities: output-level (entire response), relation-level (individual reasoning steps), and token-level (specific words). Our findings reveal distinct behavioral effects as a function of uncertainty granularity. Token-level uncertainty increased users' agreement with the AI, whereas output- and relation-level uncertainty did not increase agreement but instead reduced users' confidence in their own answers. Notably, relation-level uncertainty also reduced external verification (i.e., internet searches, checking provided URLs), steering users away from independent fact-checking and toward reliance on the LLM and its accompanying uncertainty cues. Our findings demonstrate that uncertainty granularity significantly shapes how users interact with and verify LLM outputs, providing concrete design guidance for building responsible LLM applications that encourage appropriate skepticism and verification behaviors.