Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient knowledge coverage and weak reasoning capabilities of Kazakh in existing open-source large language models (LLMs). We introduce KazLLM, the first open-source instruction-tuned LLM specifically designed for Kazakh. Built upon the LLaMA-3.1-8B architecture, KazLLM undergoes multilingual pretraining on 45.3 billion tokens and incorporates a novel multi-stage instruction tuning pipeline augmented with RLHF-driven safety alignment. This design enhances deep Kazakh linguistic understanding, cultural contextualization, and cross-lingual synergy with English, Russian, and Turkish. On dedicated Kazakh-language benchmarks, KazLLM significantly outperforms comparable open-source models while retaining competitive performance on English tasks. The model weights are publicly released, establishing a foundational infrastructure for low-resource language LLM research and deployment.

Technology Category

Application Category

📝 Abstract
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Develops a Kazakh-focused large language model
Enhances inclusivity for Kazakh speakers in LLM advancements
Outperforms existing Kazakh and multilingual models in performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weight instruction-tuned large language model
Trained on 45.3B multilingual tokens
Enhanced inclusivity for Kazakh speakers
🔎 Similar Papers
No similar papers found.
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP
Rituraj Joshi
Rituraj Joshi
Applied ML Scientist, Cerebras Systems
Machine LearningDeep LearningArtificial IntelligenceAlgorithms
Nurdaulet Mukhituly
Nurdaulet Mukhituly
PhD in NLP MBZUAI
Natural Language ProcessingMechanistic InterpretabilityAI Safety
Yuxia Wang
Yuxia Wang
MBZUAI
Natural Language Processing
Zhuohan Xie
Zhuohan Xie
MBZUAI
Financial AIReasoningNatural Language ProcessingComputational LinguisticsDeep Learning
R
Rahul Pal
Inception, UAE
Daniil Orel
Daniil Orel
MBZUAI
Astrophysicslow-resource NLPIoT
P
Parvez Mullah
Inception, UAE
Diana Turmakhan
Diana Turmakhan
MBZUAI
Low-Resource NLPMulti-modal NLP
Maiya Goloburda
Maiya Goloburda
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Uncertainty QuantificationTrustworthy NLPLLM SafetyLow-resource NLP
M
Mohammed Kamran
Inception, UAE
Samujjwal Ghosh
Samujjwal Ghosh
Inception
Large Language ModelsNatural Language ProcessingDomain AdaptationGraph Neural Networks
B
Bokang Jia
Inception, UAE
Jonibek Mansurov
Jonibek Mansurov
PhD student in NLP, MBZUAI
NLP
Mukhammed Togmanov
Mukhammed Togmanov
NLP Researcher at MBZUAI
LLM and Deep Machine Learning
Debopriyo Banerjee
Debopriyo Banerjee
Postdoctoral Researcher, Mohamed bin Zayed University of Artificial Intelligence
Natural Language ProcessingRecommendation SystemsNeuroSymbolic-AI
Nurkhan Laiyk
Nurkhan Laiyk
MBZUAI
NLP
A
Akhmed Sakip
Mohamed bin Zayed University of Artificial Intelligence, UAE
X
Xudong Han
Mohamed bin Zayed University of Artificial Intelligence, UAE
Ekaterina Kochmar
Ekaterina Kochmar
Assistant Professor, Natural Language Processing Department, MBZUAI
Natural Language ProcessingMachine LearningArtificial Intelligence in Education
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
Aaryamonvikram Singh
Aaryamonvikram Singh
IFM, MBZUAI
NLPLLMs
A
Alok Anil Jadhav
Mohamed bin Zayed University of Artificial Intelligence, UAE
S
Satheesh Katipomu
Inception, UAE
S
Samta Kamboj
Inception, UAE
Monojit Choudhury
Monojit Choudhury
Professor of Natural Language Processing, MBZUAI
Natural Language ProcessingLarge Language ModelsEthics of AIComputational Social Science
Gurpreet Gosal
Gurpreet Gosal
Cerebras Systems
language modellingmachine learningoptimization
G
Gokul Ramakrishnan
Cerebras Systems
B
Biswajit Mishra
Cerebras Systems
S
Sarath Chandran
Cerebras Systems
A
Avraham Sheinin
Cerebras Systems
Natalia Vassilieva
Natalia Vassilieva
Sr. Director of Product, Cerebras Systems
image analysisinformation retrievalinformatin extractionmachine learningnatural language processing
N
Neha Sengupta
Inception, UAE
L
Larry Murray
Inception, UAE
Preslav Nakov
Preslav Nakov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computational LinguisticsLarge Language ModelsFact-checkingFake News