🤖 AI Summary
To address the limited cross-modal understanding and generation capabilities of existing multimodal models in specialized domains such as healthcare, this paper proposes an end-to-end Multimodal Large Language Model (MLLM) supporting joint perception of text, images, and audio, along with high-fidelity speech synthesis. Our method introduces three key innovations: (1) Baichuan-Audio-Tokenizer—a novel semantic-acoustic dual-encoding audio tokenizer; (2) a multi-stage progressive alignment training framework enabling unified cross-modal representation learning and multi-task co-optimization; and (3) an efficient multimodal data cleaning and synthetic data generation pipeline ensuring high-quality training corpora. Experiments demonstrate that our model surpasses GPT-4o-mini and MiniCPM-o 2.6 on general multimodal benchmarks, achieves performance on par with Qwen2-VL-72B on multimodal medical evaluation tasks, and—critically—achieves lossless full-modality fusion for the first time.
📝 Abstract
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.