Ovis-U1 Technical Report

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying multimodal understanding, text-to-image generation, and image editing within a single model while surpassing state-of-the-art (SOTA) performance. To this end, we propose Ovis-U1—a 3B-parameter unified multimodal model—introducing a novel end-to-end training paradigm: “language model backbone + diffusion-based visual decoder + bidirectional token optimizer.” This architecture eliminates rigid boundaries between understanding and generation tasks, enabling cross-modal co-optimization. Built upon the Ovis framework, Ovis-U1 jointly learns all three capabilities in a shared parameter space. On the OpenCompass multimodal leaderboard, it achieves 69.6 points; on DPG-Bench and GenEval for generative evaluation, it scores 83.72 and 0.89, respectively. Moreover, it significantly outperforms mainstream methods on image editing benchmarks.

Technology Category

Application Category

📝 Abstract
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
Problem

Research questions and friction points this paper is trying to address.

Develops a unified model for multimodal understanding and generation
Enhances image generation and editing with diffusion-based techniques
Achieves superior performance on academic and generation benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

3B-parameter unified multimodal model
Diffusion-based visual decoder
Unified training from language model
🔎 Similar Papers
No similar papers found.
Guo-Hua Wang
Guo-Hua Wang
Alibaba
Machine LearningDeep Learning
S
Shanshan Zhao
Ovis Team, Alibaba Group
Xinjie Zhang
Xinjie Zhang
Researcher, Microsoft Research Asia
Multimodal Understanding and GenerationNeural CompressionGaussian Splatting
L
Liangfu Cao
Ovis Team, Alibaba Group
P
Pengxin Zhan
Ovis Team, Alibaba Group
L
Lunhao Duan
Ovis Team, Alibaba Group
Shiyin Lu
Shiyin Lu
Alibaba Group
Multimodal Large Language ModelsOnline LearningBandits
M
Minghao Fu
Ovis Team, Alibaba Group
X
Xiaohao Chen
Ovis Team, Alibaba Group
J
Jianshan Zhao
Ovis Team, Alibaba Group
Y
Yang Li
Ovis Team, Alibaba Group
Qing-Guo Chen
Qing-Guo Chen
alibaba-inc
machine learning