🤖 AI Summary
Current unified multimodal understanding and generation models face two key bottlenecks: (1) conditioning diffusion-based generation solely on the final-layer hidden states of multimodal large language models (MLLMs), thereby neglecting rich hierarchical representations from intermediate layers; and (2) prohibitive computational cost of end-to-end pretraining. To address these, we propose the *ladder-side diffusion tuning* paradigm, which adopts a pretrained diffusion model as the generative backbone and systematically injects MLLM features across multiple layers, coupled with parameter-efficient fine-tuning to establish a bidirectional understanding-generation co-architecture. This approach overcomes single-layer conditioning limitations, enabling fine-grained cross-modal alignment. It simultaneously enhances image generation fidelity and multimodal comprehension capabilities while significantly reducing training overhead for unified modeling. Our method achieves state-of-the-art performance across multiple benchmark tasks.
📝 Abstract
This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.