CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the limitation of existing generative models in producing raster images without editable layer structures, which hinders downstream graphic editing. The authors propose a hybrid generation framework that, for the first time, parses text regions into re-renderable protocols and integrates a vision-language model with an RGBA multi-branch diffusion architecture to separately reconstruct text, background, and sticker layers. To better align with human design preferences, they introduce ParserReward and Group Relative Policy Optimization—two reinforcement learning mechanisms—that substantially enhance controllability and editing flexibility. Evaluated on the Parser-40K and Crello datasets, the method achieves an average performance gain of 23.7% over current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.

Problem

Research questions and friction points this paper is trying to address.

graphic design parsing

raster-to-layer decomposition

editable layers

vision-language model

diffusion architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

generative image parsing

vision-language model

multi-branch diffusion