CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack systematic evaluation benchmarks for tool-driven UI design tasks—such as those performed in Figma or Sketch—hindering progress in professional design automation. Method: We introduce CANVAS, the first benchmark tailored for tool-augmented VLMs in UI design, comprising 598 context-aware design tasks across 30 mobile UI scenarios. It explicitly distinguishes “design reproduction” and “design modification” tasks, built upon real-world UI data with human-annotated ground truth. Our methodology employs a context-aware VLM–tool API co-execution framework for precise tool invocation. Contribution/Results: Experiments demonstrate that state-of-the-art VLMs exhibit nascent capability in collaborative tool-based design; however, strategic tool selection and design quality remain critical bottlenecks. CANVAS provides a reproducible, empirically grounded evaluation framework to advance VLM interaction, iterative refinement, and integration within professional design software.

Technology Category

Application Category

📝 Abstract
User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluates vision-language models' tool-based UI design capabilities
Measures ability to replicate and modify mobile interface designs
Identifies error patterns in tool invocation for design tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

CANVAS benchmark evaluates VLM tool-based UI design
Tests design replication and modification through tool invocations
Identifies strategic tool usage patterns and error types
🔎 Similar Papers
No similar papers found.
D
Daeheon Jeong
KAIST
S
Seoyeon Byun
Korea University
Kihoon Son
Kihoon Son
KAIST
Human-Computer InteractionCreativity Support ToolGenerative Agent
D
Dae Hyun Kim
Yonsei University
J
Juho Kim
KAIST