Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Arabic large language models (LLMs) suffer from severe out-of-vocabulary (OOV) issues and knowledge degradation due to static vocabulary constraints. To address this, we propose a cognitively inspired, progressive vocabulary expansion method—motivated by second-language acquisition—that dynamically incorporates Arabic subwords during pretraining, balancing decoding efficiency and knowledge retention. Our approach integrates an enhanced dynamic Byte-Pair Encoding (BPE) tokenizer, a controllable vocabulary growth strategy, and an Arabic-specific pretraining–evaluation framework. The resulting model, AraLLaMA, achieves state-of-the-art performance across multiple Arabic benchmarks, matching or exceeding the best open-source Arabic LLMs. We fully open-source the model weights, training data, and code. This work is the first to incorporate cognitive linguistic principles into vocabulary learning for LLMs, establishing a scalable, principled paradigm for training LLMs in low-resource languages.

Technology Category

Application Category

📝 Abstract
This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or ChatGPT 3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. However, using a different vocabulary often leads to a degradation of learned knowledge since many words are initially out-of-vocabulary (OOV) when training starts. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion. Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Models, training data, benchmarks, and codes will be all open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Democratizing Arabic large language models to address regional development gaps
Reducing vocabulary degradation when transitioning to Arabic-specific tokenizers
Balancing out-of-vocabulary ratios during Arabic language model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive vocabulary expansion for Arabic LLMs
Modified BPE algorithm with dynamic vocabulary
Balancing OOV ratio during training stages
🔎 Similar Papers
No similar papers found.
J
Jianqing Zhu
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
H
Huang Huang
Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data
Zhihang Lin
Zhihang Lin
Xiamen University & Shanghai Innovation Institute
Efficient Artificial Intelligence
Juhao Liang
Juhao Liang
Shenzhen Research Institue of Big Data, Shenzhen, China
Zhengyang Tang
Zhengyang Tang
CUHKSZ
Large Language ModelsMathematical ReasoningInformation Retrieval
K
Khalid Almubarak
King Abdulaziz University, Jeddah, Saudi Arabia
A
Abdulmohsen Alharthik
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Bang An
Bang An
University of Maryland, College Park
Machine LearningNatural Language Processing
Juncai He
Juncai He
Assistant Professor, Yau Mathematical Sciences Center, Tsinghua University
Deep Neural NetworksNumerical AnalysisFinite Element MethodMultigrid
X
Xiangbo Wu
Shenzhen Research Institue of Big Data, Shenzhen, China
F
Fei Yu
The Chinese University of Hong Kong, Shenzhen, China
J
Junying Chen
The Chinese University of Hong Kong, Shenzhen, China
Z
Zhuoheng Ma
The Chinese University of Hong Kong, Shenzhen, China
Y
Yuhao Du
The Chinese University of Hong Kong, Shenzhen, China
H
He Zhang
The Chinese University of Hong Kong, Shenzhen, China
E
Emad A. Alghamdi
King Abdulaziz University, Jeddah, Saudi Arabia
Lian Zhang
Lian Zhang
Student of Electrical Engineering and Computer Science, Vanderbilt University
Intelligent Human Machine SystemsMachine LearningArtificial IntelligenceAffective ComputingHuman-Computer Interactions
R
Ruoyu Sun
The Chinese University of Hong Kong, Shenzhen, China
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning
Jinchao Xu
Jinchao Xu
Professor of Applied Mathematics and Computational Sciences, KAUST
multigriddomain decompositionfinite element methodsiterative methodsdeep neural networks