🤖 AI Summary
This work addresses the limited performance of large language models on low-resource languages such as Modern Greek and the challenges associated with deploying large reasoning models. To overcome these issues, the authors propose a hybrid approach combining knowledge distillation and fine-tuning, leveraging a large reasoning model to generate high-quality Greek question-answering data for training the efficient, open-source Maistros 8B model. Key contributions include CulturaQA—the first high-quality Greek question-answering dataset—alongside a memory-efficient multilingual evaluation framework and a successful methodology for transferring reasoning capabilities to a lightweight model. Maistros 8B achieves state-of-the-art performance across nine Greek QA benchmarks, and all models, datasets, and frameworks have been publicly released.
📝 Abstract
Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enabled by large-scale training and increased model capacity. However, existing LLMs can generate erroneous responses when addressing complex queries that fall outside their training distribution, due to limited internal knowledge or the need for multi-step reasoning. To address these limitations, recent work has introduced large reasoning models (LRMs), which incorporate explicit internal reasoning processes to improve response accuracy. Additionally, state-of-the-art LRMs often comprise hundreds of billions of parameters and require several seconds per inference, even on advanced multi-GPU systems. These characteristics limit their practicality for deployment in conventional computing environments. Meanwhile, NLP research on multilingual LLMs continues to prioritize high-resource languages. However, these models exhibit limited performance in under-resourced languages, primarily due to insufficient language- and culture-specific training data. In this paper, we focus on Modern Greek, for which only a limited number of question answering (QA) datasets have been proposed, most of which are intended for model evaluation. To address this research gap in Greek QA, we make the following contributions: (i) CulturaQA, a high-quality LRM-generated and human-curated dataset, for Greek LLM training and evaluation; (ii) a memory-efficient LLM evaluation framework adaptable to diverse languages and QA tasks; (iii) Maistros 8B, a state-of-the-art open-weights Greek LLM developed via knowledge distillation and fine-tuning on CulturaQA; and (iv) a comprehensive evaluation of nine LLMs across nine human-curated Greek QA datasets.