MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

πŸ“… 2026-04-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

196K/year
πŸ€– AI Summary
Existing AIGC tools for automated web page generation often suffer from inconsistent styles and visual fragmentation due to a lack of global coordination. To address this, this work proposes a hierarchical multimodal web generation agent framework that jointly optimizes overall layout, local multimodal content, and their alignment through hierarchical planning, an iterative self-reflection mechanism, and coordinated scheduling of AIGC tools. The study introduces the first hierarchical agent architecture designed to ensure global consistency in generated web pages and establishes the first benchmark dataset and multi-level evaluation protocol tailored for multimodal web generation. Experimental results demonstrate that the proposed approach significantly outperforms both code-generation baselines and existing agent-based methods in generating and integrating multimodal elements.

Technology Category

Application Category

πŸ“ Abstract
The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.
Problem

Research questions and friction points this paper is trying to address.

AIGC
webpage generation
style inconsistency
global coherence
multimodal content
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical planning
multimodal webpage generation
AIGC integration
iterative self-reflection
visual consistency
πŸ”Ž Similar Papers
No similar papers found.