🤖 AI Summary
This work addresses the lack of principled design guidelines and poor reproducibility in deep integration of large language models (LLMs) and diffusion Transformers (DiTs) for text-to-image synthesis. We conduct the first systematic ablation study via controlled experiments, introducing three key innovations: cross-modal feature alignment, stage-wise joint fine-tuning, and gradient-masked attention. These components collectively establish an end-to-end, reproducible training paradigm. We evaluate multiple fusion architectures on large-scale datasets—including COYO-700M—demonstrating substantial improvements in fine-grained semantic consistency and generation diversity. To bridge critical gaps in architecture selection, training strategy, and scalable implementation, we publicly release full training recipes and code. This work establishes the first comprehensive benchmark and practical guideline for LLM-DiT co-design, advancing foundational research and deployment readiness in multimodal generative modeling.
📝 Abstract
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.