Ovis-Image Technical Report

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor text rendering fidelity and deployment challenges of small-scale text-to-image models under compute-constrained settings, this work proposes a lightweight and efficient architecture. It integrates the Ovis 2.5 multimodal backbone with a diffusion-based visual decoder and introduces a fine-grained two-stage training paradigm—comprising large-scale pretraining followed by domain-specific post-training—tailored for high-fidelity Chinese–English bilingual text rendering. With significantly fewer parameters than mainstream large models, the system enables efficient inference on a single high-end GPU. Experiments demonstrate superior text-to-image fidelity compared to open-source models such as Qwen-Image, approaching the performance of proprietary systems like Seedream and GPT-4o. To our knowledge, this is the first compact architecture achieving high accuracy, strong cross-lingual generalization, and practical deployability for bilingual text rendering.

Technology Category

Application Category

📝 Abstract
We introduce $ extbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
Problem

Research questions and friction points this paper is trying to address.

Develops a compact text-to-image model for high-quality text rendering
Optimizes for efficient deployment under strict computational constraints
Achieves competitive performance without oversized or proprietary architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

7B text-to-image model optimized for high-quality text rendering
Integrates diffusion-based decoder with multimodal backbone for efficient training
Deployable on single GPU, balancing performance and practical deployment
🔎 Similar Papers
No similar papers found.
Guo-Hua Wang
Guo-Hua Wang
Alibaba
Machine LearningDeep Learning
L
Liangfu Cao
Ovis Team, Alibaba Group
T
Tianyu Cui
Ovis Team, Alibaba Group
M
Minghao Fu
Ovis Team, Alibaba Group
X
Xiaohao Chen
Ovis Team, Alibaba Group
P
Pengxin Zhan
Ovis Team, Alibaba Group
J
Jianshan Zhao
Ovis Team, Alibaba Group
Lan Li
Lan Li
University of North Carolina at Chapel Hill
future of workdigital laborAI and work
Bowen Fu
Bowen Fu
Ovis Team, Alibaba Group
J
Jiaqi Liu
Ovis Team, Alibaba Group
Qing-Guo Chen
Qing-Guo Chen
alibaba-inc
machine learning