🤖 AI Summary
Existing large models struggle to jointly model and freely generate interleaved multimodal sequences—such as speech, text, images, and video—within a unified framework. This paper introduces MIO, an end-to-end autoregressive multimodal foundation model, and proposes the first four-stage causal multimodal training paradigm: alignment pretraining → interleaving pretraining → speech-augmented pretraining → multitask supervised fine-tuning. MIO employs discrete multimodal tokenization and unified causal modeling to enable true any-to-any cross-modal understanding and generation. It supports arbitrary input/output modality combinations and long-range interleaved sequence generation, unlocking novel capabilities including chain-of-visual-reasoning and instruction-driven image editing. On comprehensive cross-modal benchmarks, MIO matches or surpasses state-of-the-art dual-modal, any-to-any, and unimodal specialized models. Notably, it achieves significant performance gains on complex interleaved tasks—such as video-text generation and visual-guided instruction generation.
📝 Abstract
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.