🤖 AI Summary
Existing multimodal large language models (MLLMs) are constrained by fixed modality combinations and lack flexibility in supporting arbitrary non-linguistic modalities—such as images, audio, video, and point clouds—for interactive understanding and generation.
Method: We propose Omni-MLLMs, a unified framework for full-modality understanding and generation. We introduce the first taxonomy of four core components for Omni-MLLMs, a two-stage training paradigm (pre-alignment followed by cross-modal instruction tuning), and a suite of technical designs: unified embedding mapping, multimodal instruction data construction, and benchmark evaluation protocols.
Contributions/Results: (1) The first systematic knowledge graph for Omni-MLLMs; (2) Open-sourced classification repositories, evaluation guidelines, and onboarding pathways; (3) A distilled analysis of key challenges and research directions in full-modality alignment—providing both theoretical foundations and practical tools for developing general-purpose multimodal foundation models.
📝 Abstract
To tackle complex tasks in real-world scenarios, more researchers are focusing on Omni-MLLMs, which aim to achieve omni-modal understanding and generation. Beyond the constraints of any specific non-linguistic modality, Omni-MLLMs map various non-linguistic modalities into the embedding space of LLMs and enable the interaction and understanding of arbitrary combinations of modalities within a single model. In this paper, we systematically investigate relevant research and provide a comprehensive survey of Omni-MLLMs. Specifically, we first explain the four core components of Omni-MLLMs for unified multi-modal modeling with a meticulous taxonomy that offers novel perspectives. Then, we introduce the effective integration achieved through two-stage training and discuss the corresponding datasets as well as evaluation. Furthermore, we summarize the main challenges of current Omni-MLLMs and outline future directions. We hope this paper serves as an introduction for beginners and promotes the advancement of related research. Resources will be made public.