HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work proposes a natively unified foundation model for image generation based on a pixel-level diffusion Transformer (UiT), addressing the limitations of existing vision generative models that rely on separate text encoders and external variational autoencoders (VAEs), which hinder end-to-end unified modeling. The proposed architecture maps image pixels, text, and task-specific conditions into a shared token space, eliminating the need for an independent VAE or a pretrained text encoder. By formulating diverse generation and editing tasks as contextual inference processes, the model demonstrates exceptional scalability at massive scales: its 8B-parameter variant matches or surpasses the performance of the 27B-parameter Qwen-Image, while the 200B+ parameter Pro version establishes new state-of-the-art results across multiple generative benchmarks, significantly advancing output quality and personalization capabilities.

📝 Abstract

The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.

Problem

Research questions and friction points this paper is trying to address.

fragmented architectures

disjoint text encoders

external VAEs

multimodal generation

unified generative model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Transformer

pixel-space diffusion

natively unified architecture