🤖 AI Summary
This work addresses the inefficiency of current approaches to evaluating the utility of reasoning training data, which typically rely on costly trial-and-error fine-tuning due to the absence of effective pre-training screening methods. The authors propose a suite of intrinsic data metrics that can predict, prior to training, the downstream performance impact of reasoning data. Systematic experiments on 8B and 11B scale models reveal a strong model-size dependence in data utility: smaller models benefit more from alignment accuracy, whereas larger models gain from highly redundant yet comprehensive reasoning traces. Leveraging these insights, the study introduces a model-scale-aware validation framework for reasoning data. The proposed metrics exhibit strong and statistically significant correlations with downstream performance, enabling efficient selection of high-quality reasoning data without extensive fine-tuning.
📝 Abstract
Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.