🤖 AI Summary
Current open-source LLM post-training datasets suffer from insufficient transparency and systematic evaluation, obscuring the relationship between data quality and downstream performance. To address this, we conduct the first comparable, quantitative analysis of two major open-source SFT datasets—Tulu-3-SFT-Mix and SmolTalk—using the Magpie framework for fine-grained quality annotation. We model dataset characteristics across multiple dimensions: turn structure, task type, and input/response quality. Building on these insights, we propose a lightweight, quality-driven data distillation method to construct TuluTalk, an efficient hybrid dataset retaining only 86% of the original samples. Despite its reduced size, TuluTalk matches or exceeds the performance of both source datasets on key benchmarks. All annotated data and the TuluTalk dataset are publicly released to support reproducible research and community advancement.
📝 Abstract
Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.