UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

๐Ÿ“… 2025-07-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing CLIP-based models lack native support for image reconstruction, generation, and editing without sacrificing multimodal understanding or introducing distortion-prone decoders or quantization modules. Method: We propose a two-stage training framework with self-distillation to inject reconstruction capability while freezing the CLIP visual encoder; a dual-conditional architecture jointly modeling textโ€“image semantics by coupling a multimodal large language model (MLLM) with a diffusion Transformer; and learnable query mechanisms to enhance cross-modal feature alignment. Contribution/Results: Our method achieves state-of-the-art performance on GenEval (0.87), WISE (0.53), and ImgEdit (3.62), surpassing unified baselines including BAGEL and UniWorld-V1. It demonstrates both effectiveness and strong generalization across diverse vision-language generation and editing tasks, all while preserving CLIPโ€™s original multimodal representation fidelity.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM's strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.
Problem

Research questions and friction points this paper is trying to address.

Extends CLIP for unified multimodal understanding, generation, and editing
Maintains CLIP's comprehension while adding reconstruction capabilities
Connects MLLM and diffusion transformer for enhanced generation and editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training enhances CLIP reconstruction
Dual-condition links MLLM and diffusion transformer
Self-distillation maintains original CLIP comprehension
H
Hao Tang
Center for Data Science, Peking University
C
Chenwei Xie
Alibaba Group
X
Xiaoyi Bao
Alibaba Group, CASIA
T
Tingyu Weng
Alibaba Group
Pandeng Li
Pandeng Li
University of Science and Technology of China && Alibaba Tongyi
Video RetrievalVideo GenerationRepresentation Learning
Yun Zheng
Yun Zheng
Alibaba
Computer VisionMultimodal Modeling
L
Liwei Wang
Center for Data Science, Peking University