🤖 AI Summary
This work addresses the insufficient knowledge coverage and weak reasoning capabilities of Kazakh in existing open-source large language models (LLMs). We introduce KazLLM, the first open-source instruction-tuned LLM specifically designed for Kazakh. Built upon the LLaMA-3.1-8B architecture, KazLLM undergoes multilingual pretraining on 45.3 billion tokens and incorporates a novel multi-stage instruction tuning pipeline augmented with RLHF-driven safety alignment. This design enhances deep Kazakh linguistic understanding, cultural contextualization, and cross-lingual synergy with English, Russian, and Turkish. On dedicated Kazakh-language benchmarks, KazLLM significantly outperforms comparable open-source models while retaining competitive performance on English tasks. The model weights are publicly released, establishing a foundational infrastructure for low-resource language LLM research and deployment.
📝 Abstract
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.