AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

πŸ“… 2025-10-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing mathematical reasoning datasets suffer from high answer error rates and low information density, hindering LLM performance. To address this, we propose the first agent-driven, four-stage mathematical data generation framework: (1) seed-based filtering for high-information-density problem selection, (2) multi-agent rephrasing to enhance diversity, (3) chain-of-thought (CoT)-guided high-quality answer generation, and (4) automated evaluation-driven sample optimization. This end-to-end pipeline integrates multi-agent collaboration, CoT reasoning, and dynamic evaluation, enabling effective supervised fine-tuning of 3B–8B models using only 30–60K samples. Experiments demonstrate consistent superiority over strong baselines trained on 400K–2.3M samples across both in-domain (e.g., MATH, AMC) and cross-domain (e.g., GSM8K, SVAMP) mathematical reasoning benchmarks, significantly improving data quality and generalization capability.

Technology Category

Application Category

πŸ“ Abstract
The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality math datasets for LLM reasoning enhancement
Addressing low-quality answers and limited data richness in training
Improving mathematical reasoning efficiency through targeted data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic pipeline generates math question-answer pairs
Multi-agent system rephrases questions for diversity
Chain-of-thought reasoning enhances answer correctness
πŸ”Ž Similar Papers
No similar papers found.
X
Xianyang Liu
King’s College London
Yilin Liu
Yilin Liu
Google
AI/MLWearable devicesMotion sensingHealthcare AI
S
Shuai Wang
The Hong Kong University of Science and Technology (Guangzhou)
H
Hao Cheng
Hong Kong Baptist University
Andrew Estornell
Andrew Estornell
ByteDance Research
Large Language ModelsMulti-Agent SystemsAlgorithmic Fairness
Yuzhi Zhao
Yuzhi Zhao
Ph.D., City University of Hong Kong; B.Eng., Huazhong University of Science and Technology
Low-level VisionComputational PhotographyLLMMLLM
J
Jiaheng Wei
The Hong Kong University of Science and Technology (Guangzhou)