🤖 AI Summary
Existing design generation methods typically assume input elements are stylistically consistent, struggling to handle heterogeneous components sourced from diverse origins with visual incoherence. To address this limitation, this work proposes GIST—a training-free, identity-preserving image synthesizer that harmonizes heterogeneous elements between layout prediction and typography generation, significantly enhancing overall visual coherence. GIST introduces, for the first time, identity-preserving stylized synthesis into the component-to-design generation pipeline, enabling seamless integration into existing design systems. Evaluations using multimodal large language models (LLaVA-OV and GPT-4V) demonstrate that GIST substantially outperforms naive collage baselines within the LaDeCo and Design-o-meter frameworks, achieving marked improvements in both aesthetic quality and visual consistency.
📝 Abstract
Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.