🤖 AI Summary
This work addresses the limitation of existing generative models in producing raster images without editable layer structures, which hinders downstream graphic editing. The authors propose a hybrid generation framework that, for the first time, parses text regions into re-renderable protocols and integrates a vision-language model with an RGBA multi-branch diffusion architecture to separately reconstruct text, background, and sticker layers. To better align with human design preferences, they introduce ParserReward and Group Relative Policy Optimization—two reinforcement learning mechanisms—that substantially enhance controllability and editing flexibility. Evaluated on the Parser-40K and Crello datasets, the method achieves an average performance gain of 23.7% over current state-of-the-art approaches.
📝 Abstract
Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.