π€ AI Summary
Existing LLM uncertainty quantification (UQ) methods focus on single-turn question answering and fail to characterize uncertainty propagation in multi-step autonomous decision-making. This work addresses that gap by first disentangling uncertainty in multi-step agent decisions into *aleatoric* (intrinsic) and *epistemic* (extrinsic) components, and introduces UPropβthe first UQ framework explicitly designed for multi-step decision processes. UProp models trajectory-dependent decision processes (TDPs) and innovatively estimates extrinsic uncertainty via pointwise mutual information (PMI), thereby overcoming fundamental limitations of single-turn UQ. Evaluated on multi-step benchmarks including AgentBench and HotpotQA, UProp consistently outperforms single-turn UQ baselines, achieving both sampling efficiency and interpretable uncertainty propagation across intermediate steps. Crucially, UProp is model-agnostic and integrates seamlessly with state-of-the-art LLMs such as GPT-4.1 and DeepSeek-V3.
π Abstract
As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.