Qwen-Image-2.0 Technical Report

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current image generation models still face limitations in rendering ultra-long multilingual text, producing high-resolution photorealistic outputs, following complex instructions, and enabling efficient deployment, particularly under text-dense and compositionally intricate scenarios. This work proposes a unified foundation model for image generation that integrates high-fidelity synthesis with precise editing capabilities. For the first time, it couples a Qwen3-VL conditional encoder with a multimodal diffusion Transformer within a single framework, supporting multilingual text rendering up to 1K tokens. Through large-scale image-text data and a tailored multi-stage training strategy, the model substantially enhances layout accuracy, visual realism, and responsiveness to complex instructions. Human evaluations demonstrate its consistent superiority over the previous Qwen-Image across tasks such as slide, poster, and infographic generation.

📝 Abstract

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.

Problem

Research questions and friction points this paper is trying to address.

ultra-long text rendering

multilingual typography

high-resolution photorealism

robust instruction following

efficient deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal diffusion transformer

long-text image generation

multilingual typography