Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal browser agents struggle with long-horizon, multi-step tasks on real-world web pages, exhibiting action inconsistency and excessive trial-and-error. Method: We propose a self-evolving multi-agent framework grounded in a “Recon–Act” paradigm. It employs dual collaborative teams (reconnaissance and action), intent decomposition, tool orchestration, and rule-based code generation, leveraging vision-language models for web understanding and decision-making. A closed-loop training mechanism—driven by contrastive trajectory analysis—enables real-time tool abstraction, archival, and generalization. Contribution/Results: The framework significantly improves robustness to unseen websites and extended-duration tasks. On VisualWebArena, it achieves state-of-the-art performance, with higher task completion rates and substantially fewer trial-and-error attempts. Currently realizing Level 3 of a six-level evolutionary roadmap, the system supports autonomous evolution under limited human supervision.

Technology Category

Application Category

📝 Abstract
Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.
Problem

Research questions and friction points this paper is trying to address.

Addresses disordered action sequencing in multi-turn web tasks
Reduces excessive trial and error during browser automation
Improves adaptability to unseen websites for long-horizon tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving multi-agent framework with Reconnaissance-Action paradigm
Reconnaissance Team generates generalized tools from error analysis
Action Team executes tasks using real-time updated tool archive
🔎 Similar Papers
No similar papers found.
K
Kaiwen He
AWorld Team, Inclusion AI
Z
Zhiwei Wang
AWorld Team, Inclusion AI
Chenyi Zhuang
Chenyi Zhuang
AIST, AIRC
machine learning
Jinjie Gu
Jinjie Gu
ant group
机器学习,推荐