BLIP3o-NEXT: Next Frontier of Native Image Generation

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing text-to-image generation and image editing methods suffer from suboptimal instruction following, poor edit consistency, and limited native generation fidelity. To address these limitations, we propose the first fully open-source foundation model unifying both tasks, featuring a novel autoregressive-diffusion hybrid architecture: an autoregressive module generates discrete image tokens and extracts semantic latent states, which serve as conditional inputs to a diffusion module—thereby synergistically leveraging autoregressive reasoning and diffusion-based fine-grained detail modeling. We further incorporate reinforcement learning to optimize instruction alignment and reference-image consistency, augmented by multimodal input conditioning and a high-quality data curation engine. Our method achieves state-of-the-art performance across multiple benchmarks, delivering breakthrough improvements in generation fidelity, semantic consistency, and editing accuracy.

Technology Category

Application Category

📝 Abstract

We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

Problem

Research questions and friction points this paper is trying to address.

Advancing native image generation with unified architecture

Enhancing image editing through instruction following consistency

Integrating autoregressive reasoning with diffusion model rendering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies text-to-image generation and image editing

Uses Autoregressive plus Diffusion architecture

Leverages reinforcement learning for enhanced performance

🔎 Similar Papers

No similar papers found.