MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the limited reproducibility and community collaboration in high-performance web agents, which often rely on closed-source models and opaque data. We propose MolmoWeb, a fully open-source multimodal web agent that predicts browser actions solely from webpage screenshots and task instructions, without requiring HTML parsing or proprietary APIs. Our contributions include MolmoWebMix, the first large-scale open-source dataset for web tasks, and an open-weight agent trained exclusively on visual inputs, leveraging both synthetic and human demonstrations alongside a vision-language model to learn instruction-conditioned action policies. The agent supports parallel rollouts and best-of-N selection at test time. Evaluated on WebVoyager and Online-Mind2Web, MolmoWeb-8B achieves pass@4 success rates of 94.7% and 60.5%, respectively, significantly outperforming existing open-source systems and even some closed-source counterparts.

Technology Category

Application Category

📝 Abstract

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

Problem

Research questions and friction points this paper is trying to address.

web agents

open models

reproducibility

proprietary models

open research

Innovation

Methods, ideas, or system contributions that make the work stand out.

open web agent

multimodal reasoning

visual-language policy