Baichuan 2: Open Large-scale Language Models

📅 2023-09-19
🏛️ arXiv.org
📈 Citations: 658
Influential: 77
📄 PDF
🤖 AI Summary
To address the limited multilingual capabilities of existing open-source large language models (LLMs) and constraints imposed by closed ecosystems, this work introduces Baichuan 2 (7B/13B)—the first open-source, multilingual LLM trained from scratch to natively support both Chinese and English while enhancing domain-specific proficiency in medicine and law. Leveraging a high-quality, 2.6-trillion-token multilingual corpus, we propose a custom Transformer architecture, multi-stage curriculum pretraining, and fine-grained corpus mixing strategies. Baichuan 2 is the first open-source model of its scale to outperform Llama 2 and ChatGLM2 across comprehensive benchmarks—including CMMLU, GSM8K, HumanEval, and MMLU—while achieving state-of-the-art accuracy on medical and legal reasoning tasks. All full pretraining checkpoints are publicly released to facilitate reproducible research and training dynamics analysis.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
Problem

Research questions and friction points this paper is trying to address.

Develop open-source multilingual large language models
Enhance performance in specialized domains like medicine
Provide pre-training checkpoints for research community
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual large-scale models with 7B/13B parameters
Trained from scratch on 2.6 trillion tokens
Open-source pre-training checkpoints for research
🔎 Similar Papers
No similar papers found.
A
Ai Ming Yang
Baichuan Inc.
Bin Xiao
Bin Xiao
Meta GenAI
Computer VisionVision and LanguageMachine LearningHuman Pose Estimation
Bingning Wang
Bingning Wang
Baichuan Inc.
NLPQuestion AnsweringLarge language model
Borong Zhang
Borong Zhang
University of Macau
Reinforcement learningRobotics
C
Chao Yin
Baichuan Inc.
C
Chenxu Lv
Baichuan Inc.
D
Da Pan
Baichuan Inc.
Dian Wang
Dian Wang
Stanford University
Robot LearningRoboticsMachine LearningGeometric Deep LearningReinforcement Learning
Dong Yan
Dong Yan
AI Chief Expert, Bosch.
Reinforcement LearningFoundation Model
F
Fan Yang
Baichuan Inc.
Fei Deng
Fei Deng
Research Scientist, Google
Diffusion ModelsRLHFReinforcement LearningGenerative ModelsObject-Centric Learning
F
Feng Wang
Baichuan Inc.
F
Feng Liu
Baichuan Inc.
G
Guangwei Ai
Baichuan Inc.
G
Guosheng Dong
Baichuan Inc.
H
Hai Zhao
Baichuan Inc.
H
Hang Xu
Baichuan Inc.
Hao-Lun Sun
Hao-Lun Sun
Baichuan Inc.
H
Hongda Zhang
Baichuan Inc.
H
Hui Liu
Baichuan Inc.
J
Jiaming Ji
Baichuan Inc.
J
Jian Xie
Baichuan Inc.
J
Juntao Dai
Baichuan Inc.
K
Kuncheng Fang
Baichuan Inc.
L
Lei Su
Baichuan Inc.
L
Liang Song
Baichuan Inc.
L
Lifeng Liu
Baichuan Inc.
L
Liyun Ru
Baichuan Inc.
L
Luyao Ma
Baichuan Inc.
M
Mang Wang
Baichuan Inc.
Mickel Liu
Mickel Liu
University of Washington
Reinforcement LearningMulti-Agent LearningNatural Language Processing
M
MingAn Lin
Baichuan Inc.
N
Nuolan Nie
Baichuan Inc.
Pei Guo
Pei Guo
Soochow University
LLMsNatural Language Generation
Ruiyang Sun
Ruiyang Sun
Baichuan Inc.
Z
Zhang Tao
Baichuan Inc.
T
Tianpeng Li
Baichuan Inc.
T
Tianyu Li
Baichuan Inc.
W
Wei Cheng
Baichuan Inc.
W
Weipeng Chen
Baichuan Inc.
X
Xiangrong Zeng
Baichuan Inc.
X
Xiaochuan Wang
Baichuan Inc.
Xiaoxi Chen
Xiaoxi Chen
University of Illinois Urbana-Champaign
Diagnostic RadiologyTranslational MedicineQuantitative Medical ImagingAI in Medical Imaging
X
Xin Men
Baichuan Inc.
X
Xin Yu
Baichuan Inc.
Xuehai Pan
Xuehai Pan
Peking University
Multi-Agent LearningReinforcement LearningAI AlignmentAI Agents
Y
Yan-Bin Shen
Baichuan Inc.
Y
Yiding Wang
Baichuan Inc.
Y
Yiyu Li
Baichuan Inc.
Y
Youxin Jiang
Baichuan Inc.
Y
Yuchen Gao
Baichuan Inc.
Y
Yupeng Zhang
Baichuan Inc.
Z
Zenan Zhou
Baichuan Inc.
Z
Zhiying Wu
Baichuan Inc.