One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Short-chain reasoning language models (LLMs) face a cold-start challenge when scaling inference-time reasoning, heavily relying on proprietary models (e.g., R1) for high-quality chain-of-thought (CoT) data. Method: We introduce the first high-quality, 100K-sample long-CoT dataset specifically designed for short-chain LLMs—eliminating dependence on closed-source models. Our novel “controllable long-thought induction” pipeline integrates multi-stage prompting, strategy distillation, and adjustable thought-budget inference expansion to inject o1-style reasoning strategies into short-CoT models, augmented by human verification and automated quality assessment. Contribution/Results: The dataset matches R1 in quality; models initialized with it achieve 2–3× higher reinforcement learning gains on RLVR and demonstrate significantly improved general reasoning capabilities. This work establishes a foundational data resource and technical paradigm for cold-start training and efficient fine-tuning of open-source large reasoning models (LRMs).

Technology Category

Application Category

📝 Abstract

With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.

Problem

Research questions and friction points this paper is trying to address.

Mitigating cold-start limitations in short CoT LLMs for reasoning tasks

Creating a long CoT dataset without relying on existing large reasoning models

Enhancing reasoning skills and RL performance through controlled thought budgets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructing long CoT dataset with short CoT LLMs

Pipeline induces novel reasoning strategies controllably

Dataset enhances reasoning and RL foundation

🔎 Similar Papers

No similar papers found.