TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current unified multimodal understanding and generation models face two key bottlenecks: (1) conditioning diffusion-based generation solely on the final-layer hidden states of multimodal large language models (MLLMs), thereby neglecting rich hierarchical representations from intermediate layers; and (2) prohibitive computational cost of end-to-end pretraining. To address these, we propose the *ladder-side diffusion tuning* paradigm, which adopts a pretrained diffusion model as the generative backbone and systematically injects MLLM features across multiple layers, coupled with parameter-efficient fine-tuning to establish a bidirectional understanding-generation co-architecture. This approach overcomes single-layer conditioning limitations, enabling fine-grained cross-modal alignment. It simultaneously enhances image generation fidelity and multimodal comprehension capabilities while significantly reducing training overhead for unified modeling. Our method achieves state-of-the-art performance across multiple benchmark tasks.

Technology Category

Application Category

📝 Abstract

This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

Problem

Research questions and friction points this paper is trying to address.

Shallow connection between MLLM and diffusion models

High computational cost of pretraining unified models

Lack of fine-grained multimodal understanding-generation integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Diffusion Model with MLLM

Uses multi-layer MLLM representations

Deep unification of understanding and generation

🔎 Similar Papers

No similar papers found.