CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

166K/year
🤖 AI Summary
This work addresses the limitation of existing generative models in producing raster images without editable layer structures, which hinders downstream graphic editing. The authors propose a hybrid generation framework that, for the first time, parses text regions into re-renderable protocols and integrates a vision-language model with an RGBA multi-branch diffusion architecture to separately reconstruct text, background, and sticker layers. To better align with human design preferences, they introduce ParserReward and Group Relative Policy Optimization—two reinforcement learning mechanisms—that substantially enhance controllability and editing flexibility. Evaluated on the Parser-40K and Crello datasets, the method achieves an average performance gain of 23.7% over current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.
Problem

Research questions and friction points this paper is trying to address.

graphic design parsing
raster-to-layer decomposition
editable layers
vision-language model
diffusion architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

generative image parsing
vision-language model
multi-branch diffusion
editable layers
policy optimization
🔎 Similar Papers
No similar papers found.
W
Weidong Chen
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
Dexiang Hong
Dexiang Hong
Bytedance.Inc
Computer VisionDeep LearningDiffusion Model
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP
Y
Yutao Cheng
ByteDance Intelligent Creation, Shanghai, China
X
Xinyan Liu
School of Computer Science and Technology, Harbin Institute of Technology (Weihai), Weihai, China
L
Lei Zhang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
Y
Yongdong Zhang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230027, China