SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality, multi-turn Arabic dialogue data—particularly for reasoning and tool-use instruction tuning—is critically scarce; existing translation approaches fail to meet stringent quality requirements. To address this, we propose a multi-model ensemble translation pipeline integrating large language models (LLMs) with domain-specialized machine translation systems, coupled with a novel quality filtering mechanism explicitly designed to preserve dialogue coherence and logical consistency across turns. Through systematic ablation studies, we rigorously evaluate the impact of distinct translation strategies on decoder-only model performance. The resulting Arabic dialogue dataset, SmolKalam (derived from SmolTalk2), substantially improves translation accuracy, cross-turn consistency, and task adaptability. It represents the first high-quality Arabic dialogue resource supporting complex reasoning and tool interaction, establishing a reproducible, scalable data paradigm for post-training large language models in low-resource languages.

Technology Category

Application Category

📝 Abstract
Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale Arabic datasets with reasoning and tool calling
Need higher quality translation methods for post-training data
Developing filtered ensemble translation for Arabic language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-model ensemble translation pipeline
Quality filtering for dataset curation
Examining effective decoder-only model techniques
🔎 Similar Papers
No similar papers found.
S
Sultan Alrashed
King Abdullah University of Science and Technology (KAUST)
C
Chadi Helwe
King Abdullah University of Science and Technology (KAUST)
Francesco Orabona
Francesco Orabona
Associate Professor, KAUST
Online LearningMachine LearningOptimizationLearning Theory