A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the text encoder bottleneck in text-to-image diffusion models by systematically investigating the feasibility of replacing conventional T5/CLIP encoders with modern decoder-only large language models (LLMs). We propose a cross-layer normalized averaging embedding strategy to overcome the limited representational capacity of single-layer (e.g., final-layer) LLM embeddings. A standardized training and evaluation pipeline is established, integrating 12 mainstream decoder-only LLMs (e.g., Llama, Phi, Gemma), employing multi-layer embedding fusion and LayerNorm-weighted averaging. On MS-COCO and DrawBench, the majority of 27 evaluated LLM-based models surpass the T5 baseline, achieving an average +2.3% improvement in CLIP-Score and demonstrating superior comprehension of complex prompts and enhanced vision-language alignment. This study provides the first empirical evidence that LLMs’ hierarchical semantic representations critically influence text-to-image generation quality—establishing a novel paradigm for LLM-driven multimodal encoder design.

Technology Category

Application Category

📝 Abstract

Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

Problem

Research questions and friction points this paper is trying to address.

Evaluating decoder-only LLMs as text encoders for image generation

Comparing embedding extraction methods for better prompt alignment

Assessing LLM variants and sizes for improved text-to-image performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only LLMs as text encoders

Layer-normalized averaging for embeddings

Standardized training and evaluation pipeline

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval