From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities

📅 2024-12-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) are constrained by fixed modality combinations and lack flexibility in supporting arbitrary non-linguistic modalities—such as images, audio, video, and point clouds—for interactive understanding and generation. Method: We propose Omni-MLLMs, a unified framework for full-modality understanding and generation. We introduce the first taxonomy of four core components for Omni-MLLMs, a two-stage training paradigm (pre-alignment followed by cross-modal instruction tuning), and a suite of technical designs: unified embedding mapping, multimodal instruction data construction, and benchmark evaluation protocols. Contributions/Results: (1) The first systematic knowledge graph for Omni-MLLMs; (2) Open-sourced classification repositories, evaluation guidelines, and onboarding pathways; (3) A distilled analysis of key challenges and research directions in full-modality alignment—providing both theoretical foundations and practical tools for developing general-purpose multimodal foundation models.

Technology Category

Application Category

📝 Abstract

To tackle complex tasks in real-world scenarios, more researchers are focusing on Omni-MLLMs, which aim to achieve omni-modal understanding and generation. Beyond the constraints of any specific non-linguistic modality, Omni-MLLMs map various non-linguistic modalities into the embedding space of LLMs and enable the interaction and understanding of arbitrary combinations of modalities within a single model. In this paper, we systematically investigate relevant research and provide a comprehensive survey of Omni-MLLMs. Specifically, we first explain the four core components of Omni-MLLMs for unified multi-modal modeling with a meticulous taxonomy that offers novel perspectives. Then, we introduce the effective integration achieved through two-stage training and discuss the corresponding datasets as well as evaluation. Furthermore, we summarize the main challenges of current Omni-MLLMs and outline future directions. We hope this paper serves as an introduction for beginners and promotes the advancement of related research. Resources will be made public.

Problem

Research questions and friction points this paper is trying to address.

Survey on Omni-MLLMs for multi-modal tasks

Explore unified multi-modal modeling techniques

Identify challenges and future directions in Omni-MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni-MLLMs for multi-modal understanding

Mapping non-linguistic modalities into LLMs

Two-stage training for effective integration

🔎 Similar Papers

Surveying the MLLM Landscape: A Meta-Review of Current Surveys