ResearchMath-14K: Scaling Research-Level Mathematics via Agents

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the limitation of current language models in advanced mathematical reasoning due to the scarcity of large-scale, research-level problem datasets. The authors propose a multi-agent framework to construct ResearchMath-14K, a dataset comprising 14,056 research-grade mathematical problems sourced from academic literature, along with 220,000 reasoning trajectories. After agent-based filtering, these trajectories are used to fine-tune Qwen3-series models (4B–30B parameters). Experimental results show an average performance gain of 9.2 points post-fine-tuning. Additionally, the study reveals that newer open-source models exhibit a 5.6-fold increase in citation counts but also a 5.0-fold rise in fabricated references, indicating tendencies to evade or hallucinate responses on open-ended problems. This work presents the largest research-level mathematics dataset to date and demonstrates that even noisy, agent-filtered reasoning traces can effectively enhance model capabilities.

📝 Abstract

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

research-level mathematics

language models

mathematical reasoning

open problems

dataset scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

research-level mathematics

multi-agent pipeline

reasoning trajectories