LaTCoder: Converting Webpage Design to Code with Layout-as-Thought

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing design-to-code approaches struggle to preserve web layout fidelity when leveraging multimodal large language models (MLLMs). Method: We propose Layout-as-Thought (LaT), a novel framework that explicitly models layout structure via image patching, employs chain-of-thought prompting to guide MLLMs through stepwise spatial reasoning, and introduces a dynamic dual-assembly strategy—combining absolute positioning with MLLM-driven component assembly—to achieve high-fidelity UI reconstruction. Contribution/Results: LaT is compatible with mainstream MLLMs including DeepSeek-VL2, Gemini, and GPT-4o. It improves TreeBLEU by 66.67% and reduces mean absolute error (MAE) by 38% over baselines. Human evaluation shows superior performance in over 60% of cases, especially for complex layouts. Crucially, LaT is the first method to explicitly encode layout structure as an interpretable “thought path”, significantly enhancing both layout consistency and explainability of generated code.

Technology Category

Application Category

📝 Abstract

Converting webpage designs into code (design-to-code) plays a vital role in User Interface (UI) development for front-end developers, bridging the gap between visual design and functional implementation. While recent Multimodal Large Language Models (MLLMs) have shown significant potential in design-to-code tasks, they often fail to accurately preserve the layout during code generation. To this end, we draw inspiration from the Chain-of-Thought (CoT) reasoning in human cognition and propose LaTCoder, a novel approach that enhances layout preservation in webpage design during code generation with Layout-as-Thought (LaT). Specifically, we first introduce a simple yet efficient algorithm to divide the webpage design into image blocks. Next, we prompt MLLMs using a CoTbased approach to generate code for each block. Finally, we apply two assembly strategies-absolute positioning and an MLLM-based method-followed by dynamic selection to determine the optimal output. We evaluate the effectiveness of LaTCoder using multiple backbone MLLMs (i.e., DeepSeek-VL2, Gemini, and GPT-4o) on both a public benchmark and a newly introduced, more challenging benchmark (CC-HARD) that features complex layouts. The experimental results on automatic metrics demonstrate significant improvements. Specifically, TreeBLEU scores increased by 66.67% and MAE decreased by 38% when using DeepSeek-VL2, compared to direct prompting. Moreover, the human preference evaluation results indicate that annotators favor the webpages generated by LaTCoder in over 60% of cases, providing strong evidence of the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Enhancing layout preservation in webpage design-to-code conversion

Improving accuracy of code generation for complex webpage layouts

Addressing limitations of MLLMs in maintaining design fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divides webpage into image blocks efficiently

Uses Chain-of-Thought for block code generation

Applies dynamic assembly strategy selection

🔎 Similar Papers

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach