More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the trade-off between data quality and quantity in mathematical reasoning tasks. Motivated by the observation that large language models’ reasoning performance heavily depends on data quality—yet lacks systematic empirical evaluation—we propose a unified training–evaluation pipeline to comparatively assess mainstream open-source datasets and diverse data synthesis methods, including strong-model distillation and interpretability-aware structural enhancement. Experimental results demonstrate that high-quality, structurally enriched synthetic data substantially outperforms naive scale-up: using only one-third of the data volume, it surpasses the full-scale baseline. Moreover, the distillation-plus-structured-annotation strategy yields consistent accuracy gains of 4.2–7.8 percentage points across major mathematical benchmarks (e.g., MATH, AMC). These findings establish a reproducible, cost-effective, and high-yield paradigm for industrial-grade dataset construction in mathematical reasoning.

Technology Category

Application Category

📝 Abstract
The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Analyzing data selection and synthesis methods for mathematical reasoning in LLMs
Evaluating data quality versus quantity impact on model performance
Developing practical data strategies for industrial LLM applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data selection strategies outperform volume scaling
Interpretable data structuring enhances reasoning capabilities
Model distillation methods improve industrial applicability
🔎 Similar Papers
No similar papers found.
Y
Yike Zhao
East China Normal University
S
Simin Guo
University of Chicago
Z
Ziqing Yang
Independent Researcher
S
Shifan Han
Independent Researcher
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Fei Tan
Fei Tan
Associate Professor, East China Normal University
NLPData MiningNetwork Science