Baichuan-Omni-1.5 Technical Report

📅 2025-01-26

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the limited cross-modal understanding and generation capabilities of existing multimodal models in specialized domains such as healthcare, this paper proposes an end-to-end Multimodal Large Language Model (MLLM) supporting joint perception of text, images, and audio, along with high-fidelity speech synthesis. Our method introduces three key innovations: (1) Baichuan-Audio-Tokenizer—a novel semantic-acoustic dual-encoding audio tokenizer; (2) a multi-stage progressive alignment training framework enabling unified cross-modal representation learning and multi-task co-optimization; and (3) an efficient multimodal data cleaning and synthetic data generation pipeline ensuring high-quality training corpora. Experiments demonstrate that our model surpasses GPT-4o-mini and MiniCPM-o 2.6 on general multimodal benchmarks, achieves performance on par with Qwen2-VL-72B on multimodal medical evaluation tasks, and—critically—achieves lossless full-modality fusion for the first time.

Technology Category

Application Category

📝 Abstract

We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Information Processing

Artificial Intelligence

Healthcare Application

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Integration

Audio Comprehension Enhancement

Robust Training Methodology

🔎 Similar Papers

No similar papers found.