Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation benchmarks for end-to-end vision-driven web development, particularly in complex full-stack scenarios. To bridge this gap, we propose Vision2Web—the first hierarchical benchmark for visual website generation—structured across three tiers: static UI generation, interactive multi-page frontend replication, and long-horizon full-stack development. Built on real-world website data, Vision2Web comprises 193 tasks, 918 prototype images, and 1,255 test cases. It introduces a novel dual-component evaluation paradigm that combines a GUI agent validator with a vision-language model (VLM) judge to reliably assess both functional correctness and semantic fidelity of generated outputs. Experimental results reveal significant performance bottlenecks in current models when tackling full-stack development tasks.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Problem

Research questions and friction points this paper is trying to address.

website development

visual-to-code

agent evaluation

full-stack development

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision2Web

hierarchical benchmark

agent verification