Emu3.5: Native Multimodal Models are World Learners

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying cross-modal next-state prediction across vision and language. Methodologically, it introduces a native multimodal world model trained end-to-end via next-token prediction on a trillion-scale interleaved dataset comprising images, video frames, and transcribed text. A key innovation is Discrete Diffusion Adaptive (DiDA), enabling bidirectional parallel decoding and accelerating inference by approximately 20×. The model supports long-horizon generation, arbitrary-modality-to-image synthesis, and complex joint image-text composition, with multimodal reasoning further refined via large-scale reinforcement learning. Empirically, it matches Gemini 2.5 Flash Image (Nano Banana) on image generation and editing benchmarks while significantly outperforming it on multimodal interleaved generation tasks. The code and model are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
Problem

Research questions and friction points this paper is trying to address.

Develops a multimodal world model predicting vision-language states
Enhances multimodal reasoning through reinforcement learning training
Accelerates image inference via bidirectional parallel prediction technique
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified next-token prediction for vision-language modeling
Bidirectional parallel prediction for accelerated image inference
Large-scale reinforcement learning for multimodal reasoning
🔎 Similar Papers
No similar papers found.