SinLlama - A Large Language Model for Sinhala

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe underrepresentation of the low-resource language Sinhala in mainstream open-source large language models (LLMs), this work introduces the first decoder-only, fully open-source LLM specifically designed for Sinhala. Methodologically, we extend the Llama-3-8B architecture by augmenting its tokenizer for native Sinhala support and conduct continued pretraining followed by instruction fine-tuning on 10 million high-quality, curated Sinhala texts. Our key contributions are: (1) the first end-to-end, fully open-source Sinhala-optimized LLM; and (2) substantial performance gains on downstream tasks—outperforming both the base Llama-3-8B and its instruction-tuned variant across three standard benchmarks, particularly in text classification. This work establishes a reproducible technical pipeline and a critical baseline for developing LLMs for low-resource languages.

Technology Category

Application Category

📝 Abstract
Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of Sinhala support in open-source LLMs
Enhancing multilingual LLM for low-resource Sinhala language
Improving Sinhala text classification via instruction fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Llama-3-8B for Sinhala support
Enhances tokenizer with Sinhala vocabulary
Continual pre-training on 10M Sinhala corpus
🔎 Similar Papers
No similar papers found.