WAFFLE: Multi-Modal Model for Automated Front-End Development

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address two key challenges in automated HTML generation from UI design mockups—(1) difficulty in modeling hierarchical HTML structure and (2) weak cross-modal alignment between visual and textual representations—this paper proposes a structure-aware attention mechanism and a cross-modal contrastive fine-tuning strategy. Within a large language model (LLM) framework, our approach jointly optimizes syntactic correctness, visual fidelity, and semantic consistency for the first time. Specifically, it integrates tree-structured HTML encoding with image–text contrastive learning and introduces a new benchmark, WebSight-Test. Experimental results demonstrate significant improvements: +9.00 percentage points in HTML matching rate, +0.0982 in CW-SSIM, +32.99 in CLIP similarity, and +27.12 percentage points in LLEM score on both WebSight-Test and Design2Code—outperforming state-of-the-art methods by a substantial margin.

Technology Category

Application Category

📝 Abstract

Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.

Problem

Research questions and friction points this paper is trying to address.

Improving LLMs' understanding of HTML hierarchical structures

Bridging visual UI designs and text-based HTML code

Enhancing UI-to-HTML generation accuracy and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware attention mechanism for HTML hierarchy

Contrastive fine-tuning for UI-image-to-HTML alignment

Improved HTML generation via multimodal model finetuning

🔎 Similar Papers

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering