π€ AI Summary
This work addresses the limitations of existing vision-based web agents in reinforcement learning, which suffer from small-scale, low-diversity training data and poor fidelity in simulating real user interactions. To overcome these challenges, the authors propose the first large-scale, reproducible training framework grounded in authentic web interactions. By leveraging HTTP-level cache replay to preserve real-world states and employing large language models to synthesize diverse web environments that embody core navigation skills, the framework enables reinforcement learning across over a thousand tasks. The resulting Weblica-8B model outperforms open-source counterparts of comparable scale on multiple web navigation benchmarks, achieves superior test-time computational scalability, requires fewer reasoning steps, and matches the performance of API-based models.
π Abstract
The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.