Ovis-Image Technical Report

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
To address poor text rendering fidelity and deployment challenges of small-scale text-to-image models under compute-constrained settings, this work proposes a lightweight and efficient architecture. It integrates the Ovis 2.5 multimodal backbone with a diffusion-based visual decoder and introduces a fine-grained two-stage training paradigm—comprising large-scale pretraining followed by domain-specific post-training—tailored for high-fidelity Chinese–English bilingual text rendering. With significantly fewer parameters than mainstream large models, the system enables efficient inference on a single high-end GPU. Experiments demonstrate superior text-to-image fidelity compared to open-source models such as Qwen-Image, approaching the performance of proprietary systems like Seedream and GPT-4o. To our knowledge, this is the first compact architecture achieving high accuracy, strong cross-lingual generalization, and practical deployability for bilingual text rendering.

Technology Category

Application Category

📝 Abstract
We introduce $ extbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
Problem

Research questions and friction points this paper is trying to address.

Develops a compact text-to-image model for high-quality text rendering
Optimizes for efficient deployment under strict computational constraints
Achieves competitive performance without oversized or proprietary architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

7B text-to-image model optimized for high-quality text rendering
Integrates diffusion-based decoder with multimodal backbone for efficient training
Deployable on single GPU, balancing performance and practical deployment
🔎 Similar Papers
No similar papers found.