UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) face two key bottlenecks in UI-to-code generation: limited multimodal encoding capacity and insufficient modeling of iterative visual feedback inherent in real-world development workflows. To address these, we propose an interactive, multi-turn UI-to-code generation paradigm—the first to enable test-time scalable, visual-feedback-driven decoding. Our approach introduces an end-to-end VLM architecture optimized through three sequential stages: large-scale multimodal pretraining, instruction tuning, and reinforcement learning—unifying support for code generation, UI editing, and interface refinement. Experiments demonstrate state-of-the-art performance among open-source models on multiple UI-to-code and interface refinement benchmarks, matching or exceeding the capabilities of top proprietary models—including Claude-4-Sonnet and GPT-5—while substantially raising the upper bound on generation quality.

Technology Category

Application Category

📝 Abstract
User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^ ext{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^ ext{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.
Problem

Research questions and friction points this paper is trying to address.

Developing multimodal coding capabilities for UI programming automation
Enabling iterative visual feedback through interactive UI-to-code generation
Improving UI coding performance via unified generation, editing and polishing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive UI-to-code paradigm using visual language model
Staged training with pretraining, fine-tuning, reinforcement learning
Test-time scaling for multi-turn feedback utilization
Z
Zhen Yang
Department of Computer Science and Technology, Tsinghua University
Wenyi Hong
Wenyi Hong
Tsinghua University
multimodal pretraining
M
Mingde Xu
Zhipu AI
X
Xinyue Fan
Zhipu AI
W
Weihan Wang
Zhipu AI
J
Jiele Cheng
Department of Computer Science and Technology, Tsinghua University
Xiaotao Gu
Xiaotao Gu
Zhipu AI
Language ModelingGenerative ModelsData Mining
Jie Tang
Jie Tang
UW Madison
Computed Tomography