VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing approaches to screenshot-to-code generation often lack perceptual awareness of the visual fidelity of the generated code, making it difficult to accurately reconstruct interface layouts and styling. This work proposes VisRefiner, a novel training framework that, for the first time, leverages the visual discrepancy between rendered outputs and target designs as a supervisory signal. By doing so, it guides multimodal large language models to learn the mapping between visual differences and corresponding code modifications. Integrating reinforcement learning, VisRefiner enables self-iterative refinement, significantly improving layout fidelity and code quality in a single generation step. The approach endows the model with human-like self-debugging and refinement capabilities, closely mimicking how developers iteratively adjust code to match visual specifications.

Technology Category

Application Category

📝 Abstract

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

Problem

Research questions and friction points this paper is trying to address.

screenshot-to-code generation

visual differences

layout fidelity

code generation

frontend code

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual difference learning

screenshot-to-code generation

difference-aligned supervision