Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical shortage of AI support for low-resource Southeast Asian (SEA) languages. We introduce Sailor2, the first open-source multilingual large language model (LLM) series—comprising 1B, 8B, and 20B parameter variants—designed specifically for 13 SEA languages while retaining strong Chinese and English capabilities. Methodologically, we propose a novel SEA-centric continual pretraining paradigm over 500B tokens (400B SEA-specific data + 100B cross-lingual replay), built upon an optimized Qwen2.5 architecture. We further release a comprehensive five-stage multilingual LLM development handbook covering data curation, training, evaluation, and more. Evaluation shows Sailor2-20B achieves parity with GPT-4o on SEA-language benchmarks (50–50 win rate). All models and full reproducible training pipelines are released under the Apache 2.0 license, substantially advancing equitable AI ecosystem development in the region.

Technology Category

Application Category

📝 Abstract
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
Problem

Research questions and friction points this paper is trying to address.

Develops multilingual models for SEA languages
Enhances language proficiency in SEA region
Provides methodology for inclusive LLM development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual language models
Continuous pre-training
Comprehensive cookbook development
Longxu Dou
Longxu Dou
Research Scientist at Sea AI Lab
Natural Language Processing
Q
Qian Liu
Sea AI Lab
F
Fan Zhou
SJTU
Changyu Chen
Changyu Chen
Graduate Student, Singapore Management University
reinforcement learningdeep generative models
Zili Wang
Zili Wang
StepFun LLM Researcher & M-A-P
Large Language ModelsCode Intelligence
Z
Ziqi Jin
SUTD
Z
Zichen Liu
Sea AI Lab
Tongyao Zhu
Tongyao Zhu
National University of Singapore
Natural Language Processing
Cunxiao Du
Cunxiao Du
Research Scientist at Sea AI Lab
NLPLLM Inference
Penghui Yang
Penghui Yang
CCDS, Nanyang Technological University
Machine Learning
H
Haonan Wang
NUS
J
Jiaheng Liu
Y
Yongchi Zhao
Xiachong Feng
Xiachong Feng
The University of Hong Kong (HKU)
Natural Language Processing
X
Xin Mao
NTU
M
Man Tsung Yeung
NUS
Kunat Pipatanakul
Kunat Pipatanakul
SCB 10X
Large language modelLow-resource NLP
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP
M
Min Si Thu
Peafowl.ai
H
Hynek Kydlíˇcek
Hugging Face
Zeyi Liu
Zeyi Liu
Tsinghua University
Safety-guaranteed ControlSafety AssessmentFault DiagnosisOnilne Learning
Qunshu Lin
Qunshu Lin
Co-Founder of Abaka.AI
Data-Centric AI
S
Sittipong Sripaisarnmongkol
SCB 10X
K
Kridtaphad Sae-Khow
WiseSight
N
Nirattisai Thongchim
WiseSight
T
Taechawat Konkaew
WiseSight
N
Narong Borijindargoon
WiseSight
Anh Dao
Anh Dao
Undergraduate Student, Michigan State University
Vision-languageMultimodal LLMEmbodied AILLM
M
Matichon Maneegard
Float16.cloud
P
Phakphum Artkaew
NYU
Zheng-Xin Yong
Zheng-Xin Yong
Brown University
Machine Learning
Q
Quan Nguyen
Umeå University
Wannaphong Phatthiyaphaibun
Wannaphong Phatthiyaphaibun
PhD Student, Vidyasirimedhi Institute of Science and Technology
Natural Language Processingartificial intelligence
Hoang H. Tran
Hoang H. Tran
Ho Chi Minh City University of Technology
Natural Language ProcessingInterpretabilityReinforcement Learning
Mike Zhang
Mike Zhang
Aalborg University (Copenhagen)
Artificial IntelligenceNatural Language ProcessingInformation ExtractionNLP Applications
S
Shiqi Chen
CityU
T
Tianyu Pang
Sea AI Lab
C
Chao Du
Sea AI Lab
Xinyi Wan
Xinyi Wan
Sea AI Lab
ML Systems
W
Wei Lu
SUTD
Min Lin
Min Lin
Principal Research Scientist, Sea AI Lab
Artificial Intelligence