OneLLM: One Framework to Align All Modalities with Language

📅 2023-12-06

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 63

✨ Influential: 7

career value

188K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit strong modality-specificity and weak generalization, hindering unified processing of heterogeneous non-linguistic data. To address this, we propose OneLLM—a unified framework enabling end-to-end alignment from eight distinct modalities—images, audio, video, point clouds, depth maps, normal maps, IMU signals, and fMRI—to language. Our method introduces: (1) modality-agnostic representations via a unified multimodal encoder and a universal projection module (UPM); (2) the first progressive cross-modal alignment paradigm spanning all eight modalities; and (3) the first large-scale multimodal instruction dataset covering seven non-text modalities and containing 2 million samples. Evaluated on 25 cross-modal benchmarks—including description, question answering, and reasoning tasks—OneLLM achieves state-of-the-art or near-state-of-the-art performance. The code, models, dataset, and interactive demo are fully open-sourced.

📝 Abstract

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM.

Problem

Research questions and friction points this paper is trying to address.

Multimodal

Large Language Models

Intermodality Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

OneLLM

multi-modal large language model

progressive alignment procedure

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs