GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts

📅 2024-11-18
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Text-based logo layout design has long lacked dedicated investigation, with existing methods largely subsumed under generic layout tasks. This paper introduces the first multimodal large vision-language model (VLM) framework tailored for glyph-level layout generation. Our approach comprises three core contributions: (1) a novel glyph-level instruction-tuning paradigm; (2) a large-scale dual-text logo dataset—five times larger than prior benchmarks—featuring fine-grained geometric annotations and natural-language layout descriptions; and (3) a lightweight multi-image encoder coupled with a geometry-semantic joint supervision mechanism enabling parallel processing of multiple glyphs. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on both geometric aesthetics metrics and human preference evaluations. Moreover, it achieves stable, controllable, and high-aesthetic-quality text logo generation under complex user-specified constraints.

Technology Category

Application Category

📝 Abstract
Text logo design heavily relies on the creativity and expertise of professional designers, in which arranging element layouts is one of the most important procedures. However, few attention has been paid to this specific task which needs to take precise textural details and user constraints into consideration, but only on the broader tasks such as document/poster layout generation. In this paper, we propose a VLM-based framework that generates content-aware text logo layouts by integrating multi-modal inputs with user constraints, supporting a more flexible and stable layout design in real-world applications. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously, while does not face performance degradation. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset. Except for the geometric annotations (e.g. text masks and character recognition), we also compliment with comprehensive layout descriptions in natural language format, for more effective training to have reasoning ability when dealing with complex layouts and custom user constraints. Experimental studies demonstrate the effectiveness of our proposed model and datasets, when comparing with previous methods in various benchmarks to evaluate geometric aesthetics and human preferences. The code and datasets will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Automating text logo layout design with multi-modal inputs
Reducing computational cost for glyph image processing
Enhancing layout generation with large-scale annotated datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-based framework for text logo layouts
Reduced computational cost techniques
Extensive datasets with layout descriptions
J
Junwen He
Dalian University of Technology
Y
Yifan Wang
Dalian University of Technology
Lijun Wang
Lijun Wang
Zhejiang University
Statistical LearningBioinformaticsAstrophysics
H
Huchuan Lu
Dalian University of Technology
Jun-Yan He
Jun-Yan He
Tongyi Lab, Alibaba Group
Multimedia ComputingComputer Vision
C
Chenyang Li
Institute for Intelligent Computing, Alibaba Group
H
Hanyuan Chen
Institute for Intelligent Computing, Alibaba Group
J
Jinpeng Lan
Institute for Intelligent Computing, Alibaba Group
B
Bin Luo
Institute for Intelligent Computing, Alibaba Group
Y
Yifeng Geng
Institute for Intelligent Computing, Alibaba Group