Agentic Test-Time Scaling for WebAgents

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the performance degradation of multi-step agents in long-horizon tasks due to error accumulation and the diminishing returns of uniformly increasing inference computation. To this end, the authors propose CATTS, a novel approach that introduces, for the first time, a dynamic computation allocation mechanism based on voting uncertainty into multi-step web agents. By leveraging an LLM arbitrator and a voting aggregation strategy—augmented with uncertainty metrics such as entropy and top-1/top-2 logit gaps—the method dynamically allocates additional computational resources when decision disagreement is high. Evaluated on WebArena-Lite and GoBrowse, CATTS achieves up to a 9.1% higher success rate over ReAct while reducing token consumption by up to 2.3× compared to uniform scaling, thereby enabling efficient and interpretable test-time scaling.

Technology Category

Application Category

📝 Abstract

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

web agents

multi-step tasks

compute allocation

decision uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling

Agentic Reasoning

Dynamic Compute Allocation