UniFusion: Vision-Language Model as Unified Encoder in Image Generation

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current vision-generation models typically employ separate image and text encoders, limiting cross-modal reasoning and knowledge transfer. To address this, we propose UniFusion—a novel diffusion-based framework that for the first time leverages a frozen large vision-language model (VLM) as a unified multimodal encoder. It introduces Layerwise Attention Pooling to jointly model semantic alignment and fine-grained visual details between text and images, and incorporates VERIFI, a VLM-guided method for prompt rewriting and conditional injection without parameter tuning. By avoiding fine-tuning, UniFusion significantly improves text–image alignment accuracy and visual detail fidelity. Moreover, after single-reference image editing fine-tuning, it achieves zero-shot generalization to multi-reference editing tasks. Experimental results validate the effectiveness and strong generalization capability of unified encoding architectures in diffusion-based generation.

Technology Category

Application Category

📝 Abstract

Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models'ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

Problem

Research questions and friction points this paper is trying to address.

Unifying image and text encoders for cross-modal reasoning

Reducing computational demands in multimodal generation models

Enhancing text-image alignment and editing capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

UniFusion uses frozen vision-language model as unified encoder

Layerwise Attention Pooling extracts multi-level VLM features

VERIFI enables flexible inference with VLM-generated text tokens

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation