WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing evaluations for webpage generation predominantly rely on textual prompts or static screenshots, which are inadequate for assessing a model’s ability to reconstruct interactive workflows, temporal transitions, and motion continuity from videos. To address this gap, this work introduces the first dedicated benchmark for video-to-webpage generation, leveraging a controllable synthesis pipeline to produce 175 diverse web demonstration videos. A fine-grained visual scoring framework aligned with human preferences is also proposed. Experiments across 19 models reveal significant shortcomings in current approaches regarding the faithful reconstruction of detailed styles and dynamic effects. The proposed automatic scoring achieves 96% agreement with human judgments. The dataset, toolkit, and baseline results are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.
Problem

Research questions and friction points this paper is trying to address.

Webpage Generation
Video Conditioning
Multimodal LLMs
Benchmarking
Human-Aligned Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

WebVR
video-conditioned webpage generation
multimodal LLMs
human-aligned visual rubric
controlled synthesis pipeline
πŸ”Ž Similar Papers
No similar papers found.