WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face significant challenges in multilingual web understanding and code generation: weak multi-step reasoning, imprecise UI element localization, insufficient functional interface comprehension, difficulty preserving functionality during code editing, and structural degradation—particularly in hierarchy integrity and multilingual support—during design-to-code translation. To address these gaps, we propose WebMMU, the first multimodal evaluation benchmark tailored to real-world multilingual web pages, unifying three core tasks: visual question answering, code editing, and design-to-code generation. Its key innovation lies in the first integration of multilingual webpage screenshots, corresponding HTML/CSS/JS source code, and expert annotations into a holistic, end-to-end evaluation framework. Empirical analysis reveals that while state-of-the-art models perform reasonably well on basic information extraction, they exhibit substantial deficiencies in functionality-preserving edits, cross-lingual layout generation, and multi-step semantic reasoning—thereby establishing clear, actionable directions for future research.

Technology Category

Application Category

📝 Abstract
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal multilingual website understanding and code generation capabilities
Assessing complex multi-step reasoning and precise element grounding in web tasks
Testing functional UI comprehension and multilingual code generation abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark integrating three web tasks with expert annotations
Uses real-world multilingual web data for multimodal evaluation
Assesses multi-step reasoning and functional UI coding abilities
🔎 Similar Papers
No similar papers found.