Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the insufficient knowledge coverage and weak reasoning capabilities of Kazakh in existing open-source large language models (LLMs). We introduce KazLLM, the first open-source instruction-tuned LLM specifically designed for Kazakh. Built upon the LLaMA-3.1-8B architecture, KazLLM undergoes multilingual pretraining on 45.3 billion tokens and incorporates a novel multi-stage instruction tuning pipeline augmented with RLHF-driven safety alignment. This design enhances deep Kazakh linguistic understanding, cultural contextualization, and cross-lingual synergy with English, Russian, and Turkish. On dedicated Kazakh-language benchmarks, KazLLM significantly outperforms comparable open-source models while retaining competitive performance on English tasks. The model weights are publicly released, establishing a foundational infrastructure for low-resource language LLM research and deployment.

Technology Category

Application Category

📝 Abstract

Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Develops a Kazakh-focused large language model

Enhances inclusivity for Kazakh speakers in LLM advancements

Outperforms existing Kazakh and multilingual models in performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weight instruction-tuned large language model

Trained on 45.3B multilingual tokens

Enhanced inclusivity for Kazakh speakers

🔎 Similar Papers

No similar papers found.