FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

The impact of data synthesis strategies on the mathematical reasoning capabilities of large language models remains poorly understood. Method: We propose FLAMES, a systematic evaluation framework that quantitatively analyzes ten synthetic data generation strategies and key design factors—including filtering aggressiveness, difficulty distribution, and diversity control. We further introduce fine-grained filtering, multi-strategy fusion, and controllable difficulty generation, and conduct large-scale ablation studies on Qwen2.5-Math-7B. Contribution/Results: The FLAMES dataset achieves +15.7, +4.5, +6.5, and +3.1 accuracy gains on OlympiadBench, CollegeMath, GSMPlus, and MATH, respectively. After fine-tuning, our model attains 81.4% on MATH—surpassing Llama3-405B, GPT-4o, and Claude 3.5 Sonnet—and demonstrates significantly improved cross-domain generalization and robustness. We find that preserving broad problem coverage is more effective than aggressive filtering of high-confidence solutions, and that complexity-driven data proxies enhance generalization.

Technology Category

Application Category

📝 Abstract

Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.

Problem

Research questions and friction points this paper is trying to address.

Analyzing the impact of data synthesis factors on math reasoning

Evaluating the role of problem difficulty and diversity in synthetic data

Improving out-of-domain generalization through optimized data synthesis strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for systematic data synthesis analysis

Novel strategies enhancing out-of-domain generalization

Optimized data blend outperforms existing benchmarks

🔎 Similar Papers

No similar papers found.