Rethinking Layered Graphic Design Generation with a Top-Down Approach

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses two key challenges in AI-generated pixel art: limited editability and semantic misalignment with text prompts. We propose Accordion, the first end-to-end framework that transforms a single input image into an editable layered design while enabling prompt-driven, semantically grounded text refinement. Our method adopts a top-down, three-stage paradigm orchestrated by a vision-language model (VLM), which coordinates SAM, an element removal model, and a custom inpainting model to jointly perform layer decomposition, content reconstruction, and text regeneration—guided globally by the reference image. Innovatively, Accordion abandons conventional bottom-up synthesis in favor of a multi-expert visual collaboration mechanism and is trained on Design39K augmented with AI-generated data. On the DesignIntention benchmark, Accordion significantly outperforms prior methods in text-to-template generation, background+text composition, and de-rendering tasks. A user study validates its practical utility in enhancing both editability and creative quality.

Technology Category

Application Category

📝 Abstract

Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.

Problem

Research questions and friction points this paper is trying to address.

Convert AI-generated pixel designs into editable layered formats

Refine nonsensical AI text with meaningful user-guided alternatives

Generate layered designs using top-down visual reference guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-down layered design generation framework

Vision language model for multi-stage tasks

Leverages multiple vision experts for layers

🔎 Similar Papers

No similar papers found.