LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the scarcity of high-quality instruction data for low-resource language instruction tuning, this paper proposes a reusable monolingual instruction dataset construction paradigm: leveraging native Luxembourgish texts as source material, generating instruction-response pairs using DeepSeek-R1-0528, and applying multi-round quality filtering via LLM-as-a-judge. This yields LuxIT—the first open-source monolingual instruction-tuning dataset for Luxembourgish. Fine-tuning and benchmarking across multiple small language models demonstrate that LuxIT significantly improves performance on Luxembourgish tasks, validating the paradigm’s effectiveness; concurrently, substantial cross-model performance variation underscores the need for tailored adaptation strategies. This work provides both a methodological framework and scalable data infrastructure for instruction tuning in low-resource languages.

Technology Category

Application Category

📝 Abstract

The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.

Problem

Research questions and friction points this paper is trying to address.

Developing Luxembourgish instruction dataset for low-resource linguistic settings

Synthesizing monolingual training data using DeepSeek-R1-0528 model

Evaluating fine-tuned models' performance on Luxembourgish proficiency exams

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes Luxembourgish dataset from monolingual texts

Uses DeepSeek-R1 model for instruction generation

Applies LLM-as-judge quality assurance process

🔎 Similar Papers

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models