Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the challenge of ensuring global consistency in text-to-image generation—without requiring architectural redesign—this paper proposes the Autoregressive Representation Alignment (ARRA) framework, enabling generic large language models (LLMs) to perform cross-modal globally consistent generation without modifying their native architecture. The method introduces: (1) a global visual alignment loss that implicitly enforces spatial and semantic coherence; (2) a hybrid token mechanism jointly optimizing local pixel prediction and global semantic constraints; and (3) vision foundation model distillation for efficient cross-modal representation alignment. Evaluated on MIMIC-CXR, DeepEyeNet, and ImageNet, ARRA achieves FID improvements of 25.5%, 8.8%, and 7.5%, respectively; in medical imaging, it outperforms direct fine-tuning by 18.6%. The framework supports plug-and-play training, offering a lightweight, architecture-agnostic solution for consistent multimodal generation.

Technology Category

Application Category

📝 Abstract

We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token,. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.

Problem

Research questions and friction points this paper is trying to address.

Enables global-coherent text-to-image generation in autoregressive LLMs

Aligns LLM hidden states with visual representations via ARRA framework

Improves image generation quality without architectural changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

ARRA aligns LLM hidden states with visual representations

Uses <HYBNEXT> token for dual constraints in training

Achieves significant FID reduction without architectural changes

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval