OpenThoughts: Data Recipes for Reasoning Models

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

High-quality publicly available reasoning data is critically scarce, hindering the from-scratch training of open-source reasoning models. To address this, we propose a systematic, reproducible paradigm for open reasoning data construction: a rigorously designed data generation pipeline integrating controllable synthetic data generation, multi-stage filtering, and augmentation—validated through over 1,000 controlled experiments. Leveraging state-of-the-art teacher models (e.g., QwQ-32B), we perform chain-of-thought distillation to produce OpenThinker3—the first fully open, reproducible reasoning dataset and model family matching the performance of closed-source distilled models. The OpenThinker3-7B variant achieves SOTA-level results on AIME 2025 (53%), LiveCodeBench (51%), and GPQA Diamond (54%). All code, data, and models are publicly released under permissive open-source licenses.

Technology Category

Application Category

📝 Abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond. All of our datasets and models are available on https://openthoughts.ai.

Problem

Research questions and friction points this paper is trying to address.

Creating open-source datasets for training reasoning models

Improving reasoning model performance with public data

Systematically optimizing data generation for better benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source datasets for reasoning models training

Systematic data generation pipeline with controlled experiments

State-of-the-art results with OpenThinker3-7B model

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting