Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the critical bottlenecks facing low-resource Macedonian—namely, the scarcity of high-quality corpora, instruction-tuning data, and localized evaluation frameworks—this work introduces *domestic-yak*, the first open-source foundational large language model (8B parameters) for Macedonian. Methodologically, we conduct full pretraining and supervised fine-tuning using rigorously cleaned, deduplicated, and quality-filtered data, validated by native speakers and multi-dimensional evaluation. Our contributions are threefold: (1) the largest publicly available Macedonian corpus to date (40 GB); (2) a culturally adapted instruction dataset comprising 106K samples; and (3) the first Macedonian-specific benchmark suite covering seven diverse NLP tasks. Empirical results demonstrate that *domestic-yak* consistently outperforms same-scale baselines across all benchmarks, achieving superior grammatical accuracy and cultural appropriateness—even surpassing existing models ten times its size. All data, code, and model weights are fully open-sourced.

Technology Category

Application Category

📝 Abstract
The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at github.com/LVSTCK for source code, and at huggingface.co/LVSTCK for pretrained model weights and data.
Problem

Research questions and friction points this paper is trying to address.

Developing LLM tools for low-resource Macedonian language
Creating culturally grounded datasets for conversational applications
Training and evaluating an 8B-parameter model outperforming baselines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collected 40GB Macedonian corpus
Built 106k-instruction conversational dataset
Trained 8B-parameter domestic-yak model
🔎 Similar Papers
No similar papers found.
S
Stefan Krsteski
École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Matea Tashkovska
Matea Tashkovska
EPFL
Machine LearningNatural Language Processing
B
Borjan Sazdov
Faculty of Electrical Engineering and Information Technologies, UKIM, North Macedonia; Emteq Ltd., Brighton, United Kingdom
Hristijan Gjoreski
Hristijan Gjoreski
Ss. Cyril and Methodius University in Skopje, Emteq Labs UK
Artificial IntelligenceMachine LearningAmbient IntelligenceWearable ComputingActivity Recognition
B
Branislav Gerazov
Faculty of Electrical Engineering and Information Technologies, UKIM, North Macedonia