🤖 AI Summary
To address poor text rendering fidelity and deployment challenges of small-scale text-to-image models under compute-constrained settings, this work proposes a lightweight and efficient architecture. It integrates the Ovis 2.5 multimodal backbone with a diffusion-based visual decoder and introduces a fine-grained two-stage training paradigm—comprising large-scale pretraining followed by domain-specific post-training—tailored for high-fidelity Chinese–English bilingual text rendering. With significantly fewer parameters than mainstream large models, the system enables efficient inference on a single high-end GPU. Experiments demonstrate superior text-to-image fidelity compared to open-source models such as Qwen-Image, approaching the performance of proprietary systems like Seedream and GPT-4o. To our knowledge, this is the first compact architecture achieving high accuracy, strong cross-lingual generalization, and practical deployability for bilingual text rendering.
📝 Abstract
We introduce $ extbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.