Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-overlooked widget-level UI2Code problem—generating executable frontend code from a single, unannotated widget image under severe constraints: no semantic markup, extreme spatial limitation, and minimal contextual cues. To this end, the authors formally introduce the Widget2Code task and propose the first end-to-end framework that produces visually faithful, layout-compact, and executable code. They contribute (1) WidgetBench, the first purely image-based benchmark dataset for widgets; (2) WidgetDSL, a lightweight domain-specific language with a cross-framework compiler supporting React, HTML, and CSS; and (3) a novel architecture integrating icon retrieval, visual module reuse, and adaptive spatial rendering. Extensive experiments demonstrate significant improvements over state-of-the-art baselines across fine-grained, multi-dimensional metrics. This is the first method to achieve high-fidelity, executable widget-level UI-to-code generation, thereby bridging a critical gap in UI2Code research for micro-interfaces.

Technology Category

Application Category

📝 Abstract
User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.
Problem

Research questions and friction points this paper is trying to address.

Generates executable code from compact, context-free widget images lacking accessible markup.
Addresses unreliable and visually inconsistent code output by multimodal large language models.
Enhances visual fidelity and structured code generation for widget-to-code translation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs for widget visual understanding
WidgetDSL framework-agnostic domain-specific language
Adaptive rendering for compact spatial constraints
🔎 Similar Papers
No similar papers found.
H
Houston H. Zhang
McMaster University
T
Tao Zhang
University of Toronto
B
Baoze Lin
McMaster University
Y
Yuanqi Xue
McMaster University
Y
Yincheng Zhu
University of Waterloo
H
Huan Liu
McMaster University
L
Li Gu
Concordia University
Linfeng Ye
Linfeng Ye
University of Toronto
Information TheoryComputer VisionComputational Pathology
Ziqiang Wang
Ziqiang Wang
Concordia University
Computer Vision
Xinxin Zuo
Xinxin Zuo
Concordia University
Deep LearningComputer VisionMultimediaComputer Graphics
Y
Yang Wang
Concordia University
Y
Yuanhao Yu
McMaster University
Zhixiang Chi
Zhixiang Chi
University of Toronto
Computer VisionMachine Learning