Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical reasoning datasets suffer from limited diversity in reasoning styles, inadequate support for long-horizon trajectory modeling, and insufficient integration of external tools. Method: We construct a 7.5M-sample high-quality mathematical reasoning trajectory dataset spanning high/medium/low reasoning complexity and dual-path supervision—with and without Python tool invocation—curated from AoPS competition problems and StackExchange real-world Q&A. We propose a novel multimodal, long-context (up to 128K tokens), tool-augmented supervised learning paradigm and introduce sequence bucketing for efficient long-sequence fine-tuning—accelerating training by 2–3× without accuracy loss. Leveraging gpt-oss-120b, we perform multimodal generation integrated with controllable evaluation and Tool-Integrated Reasoning (TIR). Results: Our approach achieves 100% majority@16 accuracy on AIME 2024/2025, significantly improves robustness and generalization on HLE-Math, and maintains state-of-the-art performance on standard competition benchmarks.

Technology Category

Application Category

📝 Abstract
High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2--3$ imes$ without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.
Problem

Research questions and friction points this paper is trying to address.

Creates a large-scale dataset for diverse mathematical reasoning styles
Integrates competition and real-world problems to improve model robustness
Enables efficient long-context training for advanced mathematical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-mode supervision from GPT-4 for diverse reasoning traces
Sequential bucketed strategy for efficient long-context fine-tuning
Integration of curated competition problems with real-world queries
🔎 Similar Papers
No similar papers found.
W
Wei Du
Shubham Toshniwal
Shubham Toshniwal
Senior Research Scientist, NVIDIA
ReasoningMemoryNLP
B
Branislav Kisacanin
S
Sadegh Mahdavi
I
Ivan Moshkov
G
George Armstrong
S
Stephen Ge
Edgar Minasyan
Edgar Minasyan
F
Feng Chen
Igor Gitman
Igor Gitman
Applied Scientist, NVIDIA
Large Language ModelsMath ReasoningDeep Learning