UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified vision models excel at understanding and generation tasks but underperform on operational tasks such as image perception and editing. To address this gap, we propose the first high-resolution unified visual framework grounded in semantic features—departing from conventional VAE-based paradigms—and empirically demonstrate, for the first time, that a semantic encoder can replace VAEs to enable high-fidelity image manipulation. Our method integrates multimodal large language model (MLLM)-driven feature extraction, contrastive learning-based semantic encoding, and a lightweight diffusion decoder, optimized end-to-end in a shared semantic space. Trained on only 2.7M samples, our model achieves state-of-the-art performance across four core capabilities: visual understanding, image generation, pixel-level editing, and perceptual reasoning—surpassing leading unified models in all categories. We fully open-source the model weights, training code, and dataset.

Technology Category

Application Category

📝 Abstract
Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.
Problem

Research questions and friction points this paper is trying to address.

Unified model lacks image perception and manipulation tasks
GPT-4o-Image uses semantic encoders instead of VAEs
UniWorld improves image editing with minimal data usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses semantic encoders for unified tasks
Leverages visual-language model features
Achieves strong performance with minimal data
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
B
Bin Lin
Peking University, Shenzhen Graduate School; Rabbitpre AI
Z
Zongjian Li
Peking University, Shenzhen Graduate School; Rabbitpre AI
Xinhua Cheng
Xinhua Cheng
Peking University
computer vision
Yuwei Niu
Yuwei Niu
Chongqing university
Visual RepresentationsLanguage Priors
Y
Yang Ye
Peking University, Shenzhen Graduate School; Rabbitpre AI
X
Xianyi He
Peking University, Shenzhen Graduate School; Rabbitpre AI
S
Shenghai Yuan
Peking University, Shenzhen Graduate School; Rabbitpre AI
Wangbo Yu
Wangbo Yu
Peking University
3D VisionAIGC
S
Shaodong Wang
Peking University, Shenzhen Graduate School; Rabbitpre AI
Yunyang Ge
Yunyang Ge
北京大学
Yatian Pang
Yatian Pang
National University of Singapore
Multi-modal understandingMulti-modal generationUnified models
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants