🤖 AI Summary
To address the severe underrepresentation of the low-resource language Sinhala in mainstream open-source large language models (LLMs), this work introduces the first decoder-only, fully open-source LLM specifically designed for Sinhala. Methodologically, we extend the Llama-3-8B architecture by augmenting its tokenizer for native Sinhala support and conduct continued pretraining followed by instruction fine-tuning on 10 million high-quality, curated Sinhala texts. Our key contributions are: (1) the first end-to-end, fully open-source Sinhala-optimized LLM; and (2) substantial performance gains on downstream tasks—outperforming both the base Llama-3-8B and its instruction-tuned variant across three standard benchmarks, particularly in text classification. This work establishes a reproducible technical pipeline and a critical baseline for developing LLMs for low-resource languages.
📝 Abstract
Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.