🤖 AI Summary
High-quality, multi-turn Arabic dialogue data—particularly for reasoning and tool-use instruction tuning—is critically scarce; existing translation approaches fail to meet stringent quality requirements. To address this, we propose a multi-model ensemble translation pipeline integrating large language models (LLMs) with domain-specialized machine translation systems, coupled with a novel quality filtering mechanism explicitly designed to preserve dialogue coherence and logical consistency across turns. Through systematic ablation studies, we rigorously evaluate the impact of distinct translation strategies on decoder-only model performance. The resulting Arabic dialogue dataset, SmolKalam (derived from SmolTalk2), substantially improves translation accuracy, cross-turn consistency, and task adaptability. It represents the first high-quality Arabic dialogue resource supporting complex reasoning and tool interaction, establishing a reproducible, scalable data paradigm for post-training large language models in low-resource languages.
📝 Abstract
Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.