Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of principled design guidelines and poor reproducibility in deep integration of large language models (LLMs) and diffusion Transformers (DiTs) for text-to-image synthesis. We conduct the first systematic ablation study via controlled experiments, introducing three key innovations: cross-modal feature alignment, stage-wise joint fine-tuning, and gradient-masked attention. These components collectively establish an end-to-end, reproducible training paradigm. We evaluate multiple fusion architectures on large-scale datasets—including COYO-700M—demonstrating substantial improvements in fine-grained semantic consistency and generation diversity. To bridge critical gaps in architecture selection, training strategy, and scalable implementation, we publicly release full training recipes and code. This work establishes the first comprehensive benchmark and practical guideline for LLM-DiT co-design, advancing foundational research and deployment readiness in multimodal generative modeling.

Technology Category

Application Category

📝 Abstract
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.
Problem

Research questions and friction points this paper is trying to address.

Explores deep fusion of LLMs and DiTs for text-to-image synthesis
Addresses lack of detailed comparisons and undisclosed design details
Provides reproducible training recipe for multi-modal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep fusion of LLMs and DiTs
Controlled comparisons with baselines
Reproducible training recipe provided
🔎 Similar Papers
No similar papers found.