WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current LLM-driven web page generation suffers from coarse-grained evaluation and low data quality due to the lack of structured visual design representations. Method: This paper introduces the first instruction-to-HTML generation benchmark tailored for real-world scenarios. We propose a scalable agent-based crawling framework, adopt a structured segmented HTML representation—enriched with JSON metadata and spatially aligned local UI screenshots—and design a multimodal, section-wise evaluation protocol ensuring image-text consistency. Evaluation leverages multimodal large models to assess layout fidelity, content correctness, and visual alignment at fine granularity. Contribution/Results: Our work establishes the first end-to-end, high-granularity闭环 (“generation → structured representation → section-wise evaluation”) for web generation. Experiments demonstrate significant improvements in photorealism, structural coherence, and cross-modal alignment of generated web pages.

Technology Category

Application Category

📝 Abstract

Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.

Problem

Research questions and friction points this paper is trying to address.

Enhancing data quality for instruction-to-HTML webpage generation

Providing granular multimodal evaluation of layout and visuals

Enabling agentic crawling of real-world webpages for benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic crawling framework collects real-world webpages continuously

Structured section-wise data representation integrates multimodal components

Section-level multimodal evaluation aligns text layout and visuals

🔎 Similar Papers

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

2024-03-05Citations: 0