BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

๐Ÿ“… 2025-10-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study systematically investigates, for the first time, the ability of large language model (LLM)-based search agents to verbalize confidence during complex, multi-turn interactionsโ€”a significantly more challenging task than single-step confidence calibration. To address this, we propose Confidence-Guided Test-Time Expansion (CG-TTE), a training-free method that dynamically triggers retries and reasoning-path expansion based on generated verbalized confidence scores. CG-TTE is compatible with open-source agent frameworks and requires no fine-tuning. Experiments demonstrate substantial improvements: high-confidence predictions achieve markedly higher accuracy, while low-confidence predictions incur near-zero errors; moreover, average token consumption decreases by 37.2% without compromising task performance. Our core contributions are threefold: (1) formalizing and empirically validating the expressibility of confidence in multi-turn interactive settings; (2) establishing the first dynamic reasoning control paradigm grounded in verbalized confidence; and (3) demonstrating its efficacy in enhancing both reliability and efficiency of LLM search agents.

Technology Category

Application Category

๐Ÿ“ Abstract
Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
Problem

Research questions and friction points this paper is trying to address.

Investigating LLM confidence communication in multi-turn web interactions
Proposing test-time scaling methods using confidence-guided retry mechanisms
Reducing token consumption while maintaining competitive agent performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling uses confidence for answer quality
Models retry until reaching satisfactory confidence levels
Method reduces token consumption with competitive performance
๐Ÿ”Ž Similar Papers
No similar papers found.