🤖 AI Summary
Existing Greek-language large language models (LLMs) exhibit significant limitations in natural language understanding (NLU), generation (NLG), and code synthesis—particularly for polytonic Greek, Ancient Greek, and English-Greek bilingual contexts. To address these gaps, we introduce Llama-Krikri-8B: the first open-source, Greek-optimized LLM built upon the Llama 3.1-8B architecture, natively supporting Modern and Ancient Greek, polytonic orthography, and bilingual English-Greek processing. Methodologically, we construct three novel, publicly available Greek-specific evaluation benchmarks and pioneer a multi-stage alignment training paradigm integrating human-annotated data with high-quality synthetic data—spanning supervised fine-tuning and MAGPIE-based reinforcement learning from human preferences. Experimental results demonstrate that Llama-Krikri-8B consistently outperforms existing Greek and multilingual LLMs across NLU, NLG, and code generation tasks, achieving substantial gains in Ancient Greek comprehension and polytonic text generation.
📝 Abstract
We introduce Llama-Krikri-8B, a cutting-edge Large Language Model tailored for the Greek language, built on Meta's Llama 3.1-8B. Llama-Krikri-8B has been extensively trained on high-quality Greek data to ensure superior adaptation to linguistic nuances. With 8 billion parameters, it offers advanced capabilities while maintaining efficient computational performance. Llama-Krikri-8B supports both Modern Greek and English, and is also equipped to handle polytonic text and Ancient Greek. The chat version of Llama-Krikri-8B features a multi-stage post-training pipeline, utilizing both human and synthetic instruction and preference data, by applying techniques such as MAGPIE. In addition, for evaluation, we propose three novel public benchmarks for Greek. Our evaluation on existing as well as the proposed benchmarks shows notable improvements over comparable Greek and multilingual LLMs in both natural language understanding and generation as well as code generation.