SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

High-quality, multi-turn Arabic dialogue data—particularly for reasoning and tool-use instruction tuning—is critically scarce; existing translation approaches fail to meet stringent quality requirements. To address this, we propose a multi-model ensemble translation pipeline integrating large language models (LLMs) with domain-specialized machine translation systems, coupled with a novel quality filtering mechanism explicitly designed to preserve dialogue coherence and logical consistency across turns. Through systematic ablation studies, we rigorously evaluate the impact of distinct translation strategies on decoder-only model performance. The resulting Arabic dialogue dataset, SmolKalam (derived from SmolTalk2), substantially improves translation accuracy, cross-turn consistency, and task adaptability. It represents the first high-quality Arabic dialogue resource supporting complex reasoning and tool interaction, establishing a reproducible, scalable data paradigm for post-training large language models in low-resource languages.

Technology Category

Application Category

📝 Abstract

Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale Arabic datasets with reasoning and tool calling

Need higher quality translation methods for post-training data

Developing filtered ensemble translation for Arabic language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-model ensemble translation pipeline

Quality filtering for dataset curation

Examining effective decoder-only model techniques

🔎 Similar Papers

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation