Qwen-Image Technical Report

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address low rendering fidelity for complex texts—particularly logographic Chinese—and semantic inconsistency and insufficient visual fidelity in image editing, this paper proposes a progressive multi-task training framework. Methodologically, it establishes a large-scale, high-quality text-image data pipeline; designs a curriculum learning strategy that incrementally integrates text rendering, image-to-image reconstruction, and semantic editing tasks; and introduces a dual-encoding mechanism to jointly optimize the semantic understanding capability of Qwen2.5-VL and the generative capacity of MMDiT, while jointly modeling the VAE latent space. The key contribution is the first unified modeling framework enabling high-fidelity rendering of both Chinese and English text alongside fine-grained image editing. It achieves state-of-the-art performance across multiple benchmarks—including Text2Image and text-driven editing—yielding significant improvements in semantic consistency (+12.3% CLIP-Score) and visual quality (+8.7% FID).

Technology Category

Application Category

📝 Abstract

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing complex text rendering in image generation models

Improving image editing consistency through multi-task training

Achieving state-of-the-art performance in multilingual text rendering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive data pipeline for text rendering

Progressive training strategy for text rendering

Dual-encoding mechanism for image editing

🔎 Similar Papers

No similar papers found.