๐ค AI Summary
This study systematically investigates, for the first time, the ability of large language model (LLM)-based search agents to verbalize confidence during complex, multi-turn interactionsโa significantly more challenging task than single-step confidence calibration. To address this, we propose Confidence-Guided Test-Time Expansion (CG-TTE), a training-free method that dynamically triggers retries and reasoning-path expansion based on generated verbalized confidence scores. CG-TTE is compatible with open-source agent frameworks and requires no fine-tuning. Experiments demonstrate substantial improvements: high-confidence predictions achieve markedly higher accuracy, while low-confidence predictions incur near-zero errors; moreover, average token consumption decreases by 37.2% without compromising task performance. Our core contributions are threefold: (1) formalizing and empirically validating the expressibility of confidence in multi-turn interactive settings; (2) establishing the first dynamic reasoning control paradigm grounded in verbalized confidence; and (3) demonstrating its efficacy in enhancing both reliability and efficiency of LLM search agents.
๐ Abstract
Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.