Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing test-time scaling paradigms rely on static “think-then-act” reasoning, lacking real-time environmental perception or dynamic strategy adaptation during inference. This work introduces Test-Time Interaction Scaling (TTIS), a novel paradigm that extends the number of environment interaction steps within a single inference trajectory—enabling exploration, backtracking, and online replanning. We formally define and empirically validate “interaction length” as a test-time scaling dimension orthogonal to model size and computational budget. We propose a training-free, prompt-based interaction extension mechanism and develop TTI, an online reinforcement learning framework grounded in curriculum learning to achieve adaptive exploration–exploitation trade-offs. Evaluated on Gemma-3 12B with dynamic rollout control and interactive curriculum learning, TTIS achieves state-of-the-art performance among open-source models on WebVoyager and WebArena, significantly improving task success rates—demonstrating interaction scaling as a crucial complementary axis to conventional compute scaling.

Technology Category

Application Category

📝 Abstract
The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking"more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.
Problem

Research questions and friction points this paper is trying to address.

Enhancing agent adaptability through test-time interaction scaling
Improving web agent performance via dynamic interaction strategies
Balancing exploration and exploitation in adaptive agent training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling test-time interaction for adaptive behaviors
Curriculum-based online RL with adjustable rollouts
Balancing exploration and exploitation adaptively
🔎 Similar Papers
No similar papers found.
Junhong Shen
Junhong Shen
Ph.D. student in Machine Learning, Carnegie Mellon University
H
Hao Bai
University of Illinois Urbana-Champaign
Lunjun Zhang
Lunjun Zhang
University of Toronto
Artificial intelligenceRobotics
Y
Yifei Zhou
University of California, Berkeley
A
Amrith Rajagopal Setlur
Carnegie Mellon University
Shengbang Tong
Shengbang Tong
NYU Courant
AIComputer VisionDeep LearningRepresentation Learning
D
Diego Caples
The AGI Company
N
Nan Jiang
University of Illinois Urbana-Champaign
T
Tong Zhang
University of Illinois Urbana-Champaign
Ameet Talwalkar
Ameet Talwalkar
CMU, Datadog
Machine Learning
Aviral Kumar
Aviral Kumar
Carnegie Mellon University
AIReinforcement Learning