Ovis-Image Technical Report

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address poor text rendering fidelity and deployment challenges of small-scale text-to-image models under compute-constrained settings, this work proposes a lightweight and efficient architecture. It integrates the Ovis 2.5 multimodal backbone with a diffusion-based visual decoder and introduces a fine-grained two-stage training paradigm—comprising large-scale pretraining followed by domain-specific post-training—tailored for high-fidelity Chinese–English bilingual text rendering. With significantly fewer parameters than mainstream large models, the system enables efficient inference on a single high-end GPU. Experiments demonstrate superior text-to-image fidelity compared to open-source models such as Qwen-Image, approaching the performance of proprietary systems like Seedream and GPT-4o. To our knowledge, this is the first compact architecture achieving high accuracy, strong cross-lingual generalization, and practical deployability for bilingual text rendering.

Technology Category

Application Category

📝 Abstract

We introduce $ extbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.

Problem

Research questions and friction points this paper is trying to address.

Develops a compact text-to-image model for high-quality text rendering

Optimizes for efficient deployment under strict computational constraints

Achieves competitive performance without oversized or proprietary architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

7B text-to-image model optimized for high-quality text rendering

Integrates diffusion-based decoder with multimodal backbone for efficient training

Deployable on single GPU, balancing performance and practical deployment

🔎 Similar Papers

No similar papers found.