OpenThoughts: Data Recipes for Reasoning Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality publicly available reasoning data is critically scarce, hindering the from-scratch training of open-source reasoning models. To address this, we propose a systematic, reproducible paradigm for open reasoning data construction: a rigorously designed data generation pipeline integrating controllable synthetic data generation, multi-stage filtering, and augmentation—validated through over 1,000 controlled experiments. Leveraging state-of-the-art teacher models (e.g., QwQ-32B), we perform chain-of-thought distillation to produce OpenThinker3—the first fully open, reproducible reasoning dataset and model family matching the performance of closed-source distilled models. The OpenThinker3-7B variant achieves SOTA-level results on AIME 2025 (53%), LiveCodeBench (51%), and GPQA Diamond (54%). All code, data, and models are publicly released under permissive open-source licenses.

Technology Category

Application Category

📝 Abstract
Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond. All of our datasets and models are available on https://openthoughts.ai.
Problem

Research questions and friction points this paper is trying to address.

Creating open-source datasets for training reasoning models
Improving reasoning model performance with public data
Systematically optimizing data generation for better benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source datasets for reasoning models training
Systematic data generation pipeline with controlled experiments
State-of-the-art results with OpenThinker3-7B model
🔎 Similar Papers
No similar papers found.
E
E. Guha
Stanford University
R
Ryan Marten
BespokeLabs.ai
S
Sedrick Scott Keh
Toyota Research Institute
N
Negin Raoof
G
G. Smyrnis
Hritik Bansal
Hritik Bansal
University of California Los Angeles | Indian Institute of Technology Delhi
Multimodal LearningLanguage Modeling
M
Marianna Nezhurina
JSC, LAION
J
Jean-Pierre Mercat
Toyota Research Institute
T
Trung Vu
BespokeLabs.ai
Zayne Sprague
Zayne Sprague
Graduate Student at the University of Texas in Austin
Artificial IntelligenceMachine LearningDeep LearningNatural Language Understanding
Ashima Suvarna
Ashima Suvarna
University of California, Los Angeles
Natural Language ProcessingMachine LearningLLM Alignment
B
Ben Feuer
L
Liangyu Chen
Stanford University
Zaid Khan
Zaid Khan
UNC Chapel Hill
machine learningdeep learningartificial intelligence
E
Eric Frankel
University of Washington
S
Sachin Grover
ASU
C
Caroline Choi
Stanford University
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
S
Shiye Su
Stanford University
Wanjia Zhao
Wanjia Zhao
Stanford University
Machine Learning
J
John Yang
Stanford University
S
Shreyas Pimpalgaonkar
BespokeLabs.ai
Kartik Sharma
Kartik Sharma
Ph.D. student, Georgia Institute of Technology
Controllable GenerationAdversarial RobustnessStructural Alignment
C
Charlie Cheng-Jie Ji
BespokeLabs.ai
Y
Yichuan Deng
University of Washington
S
Sarah Pratt
University of Washington
V
V. Ramanujan
University of Washington
Jon Saad-Falcon
Jon Saad-Falcon
PhD Student at Stanford University
Natural Language ProcessingMachine LearningInformation RetrievalSystems for ML
Jeffrey Li
Jeffrey Li
University of Washington
Machine Learning
Achal Dave
Achal Dave
Toyota Research Institute
Alon Albalak
Alon Albalak
Lila Sciences
Data-Centric AIMachine LearningOpen-Endedness
K
Kushal Arora
Toyota Research Institute
B
Blake Wulfe
Toyota Research Institute
Chinmay Hegde
Chinmay Hegde
New York University
AI
Greg Durrett
Greg Durrett
Associate Professor of Computer Science, New York University
Natural Language Processing
S
Sewoong Oh
University of Washington
Mohit Bansal
Mohit Bansal
Parker Distinguished Professor, Computer Science, UNC Chapel Hill
Natural Language ProcessingComputer VisionMachine LearningMultimodal AI
S
Saadia Gabriel
UCLA
Aditya Grover
Aditya Grover
Co-founder, Inception | Prof, UCLA | PhD, Stanford
Generative AIMachine LearningAI for Science
K
Kai-Wei Chang
UCLA
Vaishaal Shankar
Vaishaal Shankar
Apple
Machine LearningML RobustnessML ReliabilityDeep Learning
Aaron Gokaslan
Aaron Gokaslan
Cornell University
computer visiongraphicsdeep learningrobotics
Mike A. Merrill
Mike A. Merrill
Postdoc, Stanford University
language modelsagents
Tatsunori Hashimoto
Tatsunori Hashimoto
Assistant Professor, Stanford
Machine LearningStatisticsNLP
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
J
J. Jitsev
JSC, LAION
Reinhard Heckel
Reinhard Heckel
Technical University of Munich and Rice University
M
M. Sathiamoorthy
BespokeLabs.ai
A
Alexandros G. Dimakis
UC Berkeley, UT Austin
Ludwig Schmidt
Ludwig Schmidt
Stanford University and Anthropic
Machine LearningArtificial IntelligenceOptimizationAlgorithmsStatistics