🤖 AI Summary
In real-world web automation, irreversible actions render backtracking search methods (e.g., tree search) ineffective and inefficient.
Method: We propose a model-driven planning paradigm centered on Dreamer-7B—a lightweight, LLM-augmented world model that uniquely serves *both* as a world model and a value function. We introduce a scalable data synthesis and distillation pipeline, eliminating reliance on sandboxed environments. The WebDreamer framework enables multi-step action simulation and consequence evaluation for pre-execution planning.
Contributions/Results: On VisualWebArena, WebDreamer matches tree search performance while accelerating inference 4–5×. It significantly outperforms reactive baselines on real-world benchmarks Online-Mind2Web and Mind2Web-Live. Notably, Dreamer-7B achieves performance comparable to GPT-4o, demonstrating the viability of efficient small-scale models for complex web navigation planning.
📝 Abstract
Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamer achieves substantial performance improvements over reactive baselines. It is competitive, while being 4-5 times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments.