Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

📅 2024-03-30
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
To address catastrophic forgetting, insufficient safety alignment, high computational costs, and weak regulatory compliance in multilingual large language models (LLMs) under continual learning, this work introduces Aurora-M, a 15B-parameter open-source multilingual LLM. Built upon the StarCoderPlus architecture, it undergoes continual pretraining on 435B new tokens—exceeding 2T total tokens—incorporating English, Finnish, Hindi, Japanese, Vietnamese, and multilingual code corpora. We propose the first human-audited safety instruction tuning paradigm, explicitly designed to fulfill the safety, trustworthiness, and governance requirements of the U.S. Biden-Harris Executive Order on AI; it further integrates red-teaming and regulatory compliance validation. Experiments demonstrate substantial improvements over baselines in multilingual understanding, code generation, and safety evaluation, alongside robust retention under continual learning. The model and its variants are publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract
Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Support
Memory Retention in Continual Learning
High Computational Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Pre-training
Continuous Learning
AI Safety Compliance
🔎 Similar Papers
No similar papers found.
Taishi Nakamura
Taishi Nakamura
Institute of Science Tokyo
artificial general intelligencelarge language modelsmachine learning
M
Mayank Mishra
MIT-IBM Watson Lab
Simone Tedeschi
Simone Tedeschi
Applied Scientist @ Amazon
Natural Language ProcessingLarge Language ModelsResponsible AI
Yekun Chai
Yekun Chai
Baidu
natural language processingmachine learning
J
Jason T Stillerman
Felix Friedrich
Felix Friedrich
postdoc @ Meta FAIR, Montreal
Multimodal AIGenerative AIAI AlignmentAI Safety
Prateek Yadav
Prateek Yadav
PhD, University of North Carolina Chapel Hill
Continual LearningMoEModel MergingModular NetworkEfficient AI
Tanmay Laud
Tanmay Laud
Hippocratic AI, University Of California San Diego
AINLPNLUMLDeep Learning
V
Vu Minh Chien
Detomo Inc.
Terry Yue Zhuo
Terry Yue Zhuo
Researcher
Large Language ModelsCode GenerationAI4SECybersecurity
D
Diganta Misra
Mila - Quebec AI Institute, Carnegie Mellon University
Ben Bogin
Ben Bogin
Google
NLP
X
Xuan-Son Vu
WASP Media & Language, Umeå University, DeepTensor AB
Marzena Karpinska
Marzena Karpinska
Senior Researcher at Microsoft
natural language processinglanguage modelsevaluation
A
Arnav Dantuluri
Wojciech Kusa
Wojciech Kusa
NASK National Research Institute
Natural Language ProcessingInformation RetrievalMachine LearningLLMs
T
Tommaso Furlanello
Rio Yokota
Rio Yokota
Professor, Institute of Science Tokyo
high performance computinglarge scale deep learninghierarchical low-rank matricesGPU computing
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
S
Suhas Pai
Tosin P. Adewumi
Tosin P. Adewumi
Luleå University of Technology, EISLAB, Machine Learning Group
artificial intelligencemachine learningNLPalgorithmsblockchain
V
Veronika Laippala
Xiaozhe Yao
Xiaozhe Yao
ETH Zurich
Machine Learning SystemsMachine LearningLLMs
A
Adalberto Junior
A
Alpay Ariyak
RunPod, OpenChat
Aleksandr Drozd
Aleksandr Drozd
RIKEN CCS
J
Jordan Clive
Chattermill AI
Kshitij Gupta
Kshitij Gupta
Mila - Quebec AI Institute
L
Liangyu Chen
Q
Qi Sun
K
Ken Tsui
N
Noah Persaud
N
Nour Fahmy
Tianlong Chen
Tianlong Chen
Assistant Professor, CS@UNC Chapel Hill; Chief AI Scientist, hireEZ
Machine LearningAI4ScienceComputer VisionSparsity
Mohit Bansal
Mohit Bansal
Parker Distinguished Professor, Computer Science, UNC Chapel Hill
Natural Language ProcessingComputer VisionMachine LearningMultimodal AI
N
Nicolò Monti
ASC
T
Tai Dang
University of Massachusetts Amherst
Ziyang Luo
Ziyang Luo
Salesforce AI Research
AgentsLLMsMultimodal
T
Tien-Tung Bui
DopikAI JSC
Roberto Navigli
Roberto Navigli
Professor, Sapienza University of Rome
Natural Language ProcessingSemanticsComputational LinguisticsKnowledge AcquisitionArtificial Intelligence
V
Virendra Mehta
University of Trento
M
Matthew Blumberg
GridRepublic
Victor May
Victor May
Google
Machine Learning
Huu Nguyen
Huu Nguyen
Ontocord.ai
LLMsData miningNLPAI ethics
Sampo Pyysalo
Sampo Pyysalo
University of Turku