Agentic Test-Time Scaling for WebAgents

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of multi-step agents in long-horizon tasks due to error accumulation and the diminishing returns of uniformly increasing inference computation. To this end, the authors propose CATTS, a novel approach that introduces, for the first time, a dynamic computation allocation mechanism based on voting uncertainty into multi-step web agents. By leveraging an LLM arbitrator and a voting aggregation strategy—augmented with uncertainty metrics such as entropy and top-1/top-2 logit gaps—the method dynamically allocates additional computational resources when decision disagreement is high. Evaluated on WebArena-Lite and GoBrowse, CATTS achieves up to a 9.1% higher success rate over ReAct while reducing token consumption by up to 2.3× compared to uniform scaling, thereby enabling efficient and interpretable test-time scaling.

Technology Category

Application Category

📝 Abstract
Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
web agents
multi-step tasks
compute allocation
decision uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling
Agentic Reasoning
Dynamic Compute Allocation
Uncertainty Estimation
Web Agents
🔎 Similar Papers
No similar papers found.