Baichuan-Omni-1.5 Technical Report

📅 2025-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited cross-modal understanding and generation capabilities of existing multimodal models in specialized domains such as healthcare, this paper proposes an end-to-end Multimodal Large Language Model (MLLM) supporting joint perception of text, images, and audio, along with high-fidelity speech synthesis. Our method introduces three key innovations: (1) Baichuan-Audio-Tokenizer—a novel semantic-acoustic dual-encoding audio tokenizer; (2) a multi-stage progressive alignment training framework enabling unified cross-modal representation learning and multi-task co-optimization; and (3) an efficient multimodal data cleaning and synthetic data generation pipeline ensuring high-quality training corpora. Experiments demonstrate that our model surpasses GPT-4o-mini and MiniCPM-o 2.6 on general multimodal benchmarks, achieves performance on par with Qwen2-VL-72B on multimodal medical evaluation tasks, and—critically—achieves lossless full-modality fusion for the first time.

Technology Category

Application Category

📝 Abstract
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Information Processing
Artificial Intelligence
Healthcare Application
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Integration
Audio Comprehension Enhancement
Robust Training Methodology
🔎 Similar Papers
No similar papers found.
Y
Yadong Li
J
Jun Liu
T
Tao Zhang
S
Song Chen
T
Tianpeng Li
Z
Zehuan Li
L
Lijun Liu
Lingfeng Ming
Lingfeng Ming
Alibaba Group
Large Language ModelNatural Language Processing
G
Guosheng Dong
D
Dawei Pan
C
Chong Li
Y
Yuanbo Fang
D
Dongdong Kuang
M
Mingrui Wang
C
Chenglin Zhu
Y
Youwei Zhang
Hongyu Guo
Hongyu Guo
Senior Research Scientist@NRC Canada, Adjunct Professor@University of Ottawa
machine learningdeep learninggeometric generative modelgraph network
F
Fengyu Zhang
Y
Yuran Wang
B
Bowen Ding
W
Wei Song
X
Xu Li
Yuqi Huo
Yuqi Huo
Bytedance Inc.
multi-modal pretraining
Z
Zheng Liang
S
Shusen Zhang
X
Xin Wu
S
Shuai Zhao
L
Linchu Xiong
Y
Yozhen Wu
J
Jiahui Ye
Wenhao Lu
Wenhao Lu
Mirosoft
AIMLCVNLP
B
Bowen Li
Y
Yan Zhang
Y
Yaqi Zhou
X
Xin Chen
L
Lei Su
H
Hongda Zhang
F
Fuzhong Chen
X
Xuezhen Dong
N
Na Nie
Z
Zhiying Wu
Bin Xiao
Bin Xiao
Meta GenAI
Computer VisionVision and LanguageMachine LearningHuman Pose Estimation
T
Ting Li
S
Shunya Dang
P
Ping Zhang
Yijia Sun
Yijia Sun
Carnegie Mellon University
Applied mathematicsoptimizationcomputer-aided molecular design
J
Jincheng Wu
J
Jinjie Yang
X
Xionghai Lin
Zhi Ma
Zhi Ma
China Mobile (Hangzhou) Information Technology Co., Ltd.
Edge Intelligence Deep Learning LLM
K
Kegeng Wu
J
Jia Li
A
Aiyuan Yang
H
Hui Liu
J
Jianqiang Zhang
Xiaoxi Chen
Xiaoxi Chen
University of Illinois Urbana-Champaign
Diagnostic RadiologyTranslational MedicineQuantitative Medical ImagingAI in Medical Imaging
G
Guangwei Ai
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
Y
Yicong Chen
X
Xiaoqin Huang
K
Kun Li
W
Wenjing Luo
Y
Yifei Duan
L
Lingling Zhu
R
Ran Xiao
Z
Zhe Su
J
Jiani Pu
Dian Wang
Dian Wang
Stanford University
Robot LearningRoboticsMachine LearningGeometric Deep LearningReinforcement Learning
Xu Jia
Xu Jia
Associate Professor at Dalian University of Technology
Computer VisionMachine LearningBio-Inspired Vision
T
Tianyu Zhang
M
Mengyu Ai
M
Mang Wang
Y
Yujing Qiao
L
Lei Zhang
Yanjun Shen
Yanjun Shen
Center for Agricultural Resources Research, Chinese Academy of Sciences
evapotranspirationhydrologyecohydrologyagricultural water management
F
Fan Yang
M
Miao Zhen
Yijie Zhou
Yijie Zhou
The Chinese University of Hong Kong, Shenzhen
Distributed OptimizationPrivacy Preserving
Mingyang Chen
Mingyang Chen
Baichuan Inc., Zhejiang University, The University of Edinburgh
Large Language ModelReinforcement LearningKnowledge Graph
F
Fei Li
C
Chenzheng Zhu
K
Keer Lu
Y
Yaqi Zhao
H
Hao Liang
Y
Youquan Li
Y
Yanzhao Qin
Linzhuang Sun
Linzhuang Sun
University of Chinese Academy of Sciences
Multimodal Reasoning
Jianhua Xu
Jianhua Xu
University of Electronic Science and Technology of China
Multi-Agent、Evolutionary Games、LLM-Agents
Haoze Sun
Haoze Sun
Tsinghua University
Low-level image processingImage super-resolutionDiffusion generation model
Mingan Lin
Mingan Lin
baichuan-inc
LLM、MLLM、AI
Z
Zenan Zhou
W
Weipeng Chen