Emerging Properties in Unified Multimodal Pretraining

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitations of existing open-source multimodal foundation models in achieving unified understanding and generation capabilities. We introduce BAGEL—the first open-source, natively multimodal unified decoder model supporting both understanding and generation within a single architecture. BAGEL is pretrained on a trillion-scale interleaved dataset comprising images, videos, and web text, leveraging a decoder-only design, cross-modal tokenization, and a custom hybrid-modality data curation and balancing protocol. To our knowledge, it is the first open framework to realize a truly unified understanding-generation paradigm. Empirically, we observe emergent reasoning capabilities—including free-form image editing, future-frame prediction, 3D manipulation, and world navigation—driven by the scale of interleaved training data. BAGEL achieves state-of-the-art performance across major multimodal understanding and generation benchmarks, surpassing all prior open-source unified models. Code, pretrained checkpoints, and full pretraining details are publicly released.

Technology Category

Application Category

📝 Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Problem

Research questions and friction points this paper is trying to address.

Develops BAGEL for unified multimodal understanding and generation

Enhances complex reasoning with diverse interleaved data training

Outperforms open-source models in multimodal benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified decoder-only model for multimodal tasks

Pretrained on diverse interleaved text and media

Exhibits emerging complex multimodal reasoning abilities

🔎 Similar Papers

No similar papers found.