Baichuan-Omni Technical Report

📅 2024-10-11
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
High-performance open-source multimodal large language models (MLLMs) remain scarce, hindering unified processing of images, videos, audio, and text with real-time interactive capabilities. To address this, we propose Baichuan-Omni—the first open-source 7B-parameter MLLM—introducing a novel two-stage training paradigm: “modality alignment” followed by “multi-task fine-tuning.” It integrates a CLIP-based visual encoder, a Whisper-based audio encoder, and learnable modality adapters to achieve unified cross-modal representation learning and joint reasoning across vision, audio, and language. Evaluated on major benchmarks—including OmniBench, MMBench, and VideoMME—Baichuan-Omni achieves state-of-the-art performance among open-source models. Moreover, it supports low-latency, real-time speech–vision–text interaction. By bridging the gap between capability and openness, Baichuan-Omni establishes a new foundation for accessible, high-fidelity multimodal intelligence research and applications.

Technology Category

Application Category

📝 Abstract
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Processing
Open-source Model
Interactive Experience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal processing
Large-scale model
Open-source community
🔎 Similar Papers
No similar papers found.
Y
Yadong Li
Baichuan Inc.
Haoze Sun
Haoze Sun
Tsinghua University
Low-level image processingImage super-resolutionDiffusion generation model
Mingan Lin
Mingan Lin
baichuan-inc
LLM、MLLM、AI
T
Tianpeng Li
Baichuan Inc.
G
Guosheng Dong
Baichuan Inc.
T
Tao Zhang
Baichuan Inc.
B
Bowen Ding
Westlake University, Zhejiang University
W
Wei Song
Westlake University, Zhejiang University
Zhenglin Cheng
Zhenglin Cheng
Zhejiang University & Westlake University, SII
Multimodal LearningDiffusion Models
Yuqi Huo
Yuqi Huo
Bytedance Inc.
multi-modal pretraining
S
Song Chen
Baichuan Inc.
X
Xu Li
Baichuan Inc.
D
Da Pan
Baichuan Inc.
S
Shusen Zhang
Baichuan Inc.
X
Xin Wu
Baichuan Inc.
Z
Zheng Liang
Baichuan Inc.
J
Jun Liu
Baichuan Inc.
K
Keer Lu
Baichuan Inc.
Y
Yaqi Zhao
Baichuan Inc.
Yanjun Shen
Yanjun Shen
Center for Agricultural Resources Research, Chinese Academy of Sciences
evapotranspirationhydrologyecohydrologyagricultural water management
F
Fan Yang
Baichuan Inc.
Kaicheng Yu
Kaicheng Yu
Assistant Professor, Westlake University, PI of Autonomous Intelligence Lab
computer vision3D understandingautonomous perceptionautomatic machine learning
T
Tao Lin
Westlake University
Jianhua Xu
Jianhua Xu
University of Electronic Science and Technology of China
Multi-Agent、Evolutionary Games、LLM-Agents
Z
Zenan Zhou
Baichuan Inc.
W
Weipeng Chen
Baichuan Inc.