🤖 AI Summary
Real-world web pages exhibit complex, verbose HTML/CSS structures, hindering multimodal large language models (MLLMs) from effectively modeling long-range UI topological relationships. To address this, we propose UICopilot: an end-to-end, hierarchical frontend code generation framework. Its core contributions are: (1) a hierarchical generation paradigm that decouples output into three sequential stages—layout, component instantiation, and styling; (2) a structure-aware, multi-granularity prompting mechanism that explicitly encodes UI hierarchy and spatial relations; and (3) an MLLM architecture integrating a vision encoder with hierarchical instruction fine-tuning. Evaluated on WebCode2M—a large-scale real-world webpage dataset—UICopilot achieves a 32% improvement over baselines (e.g., GPT-4V) in automated metrics and is preferred by 87% of human evaluators. These results demonstrate substantial advances in both practical usability and structural fidelity for multimodal UI code generation.
📝 Abstract
Automating the synthesis of User Interfaces (UIs) plays a crucial role in enhancing productivity and accelerating the development lifecycle, reducing both development time and manual effort. Recently, the rapid development of Multimodal Large Language Models (MLLMs) has made it possible to generate front-end Hypertext Markup Language (HTML) code directly from webpage designs. However, real-world webpages encompass not only a diverse array of HTML tags but also complex stylesheets, resulting in significantly lengthy code. The lengthy code poses challenges for the performance and efficiency of MLLMs, especially in capturing the structural information of UI designs. To address these challenges, this paper proposes UICopilot, a novel approach to automating UI synthesis via hierarchical code generation from webpage designs. To validate the effectiveness of UICopilot, we conduct experiments on a real-world dataset, i.e., WebCode2M. Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation metrics and human evaluations. Specifically, statistical analysis reveals that the majority of human annotators prefer the webpages generated by UICopilot over those produced by GPT-4V.