🤖 AI Summary
Current web agents exhibit poor generalization to unseen environments, rely heavily on website-specific fine-tuning, and struggle to model structural and dynamic environmental properties—leading to inefficient planning. This paper proposes a fine-tuning-free, memory-augmented modular architecture that constructs a lightweight cognitive map via exploratory interaction, and integrates hierarchical planning, a world model, and forward-looking re-planning to enable action simulation and policy optimization within a learned cognitive space. Its core innovation lies in unifying the Actor-Critic framework with an executable action simulator and a critic module, jointly supporting plan execution, mental rehearsal, and real-time policy correction. Evaluated on WebArena-Lite, our approach achieves a 63.0% task success rate—substantially surpassing the prior state-of-the-art (53.9%). Ablation studies confirm the significant contribution of each component.
📝 Abstract
We observe that current state-of-the-art web-agents are unable to effectively adapt to new environments without neural network fine-tuning, without which they produce inefficient execution plans due to a lack of awareness of the structure and dynamics of the new environment. To address this limitation, we introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action Simulation), a memory-augmented agent that is able to make plans grounded in a model of the environment by simulating the consequences of those actions in cognitive space. Our agent starts by building a "cognitive map" by performing a lightweight curiosity driven exploration of the environment. The planner proposes candidate actions; the simulator predicts their consequences in cognitive space; a critic analyzes the options to select the best roll-out and update the original plan; and a browser executor performs the chosen action. On the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9% success rate for the previously published state-of-the-art. Unlike previous systems, our modular architecture requires no website-specific LLM fine-tuning. Ablations show sizable drops without the world-model, hierarchical planner, and look-ahead-based replanner confirming their complementary roles within the design of our system