RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large-scale long-chain-of-thought (Long-CoT) data can elicit “slow thinking” capabilities in large language models. To this end, we propose RedStar, the first framework to scale Long-CoT data to one million samples, and empirically demonstrate that even a small number of high-difficulty examples suffices to substantially activate deep reasoning. Methodologically, RedStar introduces an RL-scale training paradigm integrating multi-scale supervised fine-tuning, large-scale reinforcement learning, and joint code–mathematical modeling. Experiments show that RedStar-code-math achieves 81.6% accuracy on MATH-Hard (+15.4 percentage points) and solves 46.7% of AIME problems—using only 21K samples. RedStar-Geo outperforms QvQ-Preview on multimodal geometric reasoning benchmarks including GeoQA. Collectively, these results validate a synergistic pathway wherein both data scale and quality jointly drive slow-thinking capability enhancement.

Technology Category

Application Category

📝 Abstract
Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME), it solves 46.7% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.
Problem

Research questions and friction points this paper is trying to address.

Deep Learning
Cognitive Systems
Data Scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-chain Thinking
Reinforcement Learning
Adaptability and Generalization
🔎 Similar Papers
No similar papers found.
H
Haotian Xu
Xiaohongshu Inc
X
Xing Wu
Xiaohongshu Inc
Weinong Wang
Weinong Wang
Xian Jiaotong University
LLM/VLLM/RL
Zhongzhi Li
Zhongzhi Li
Institute of Automation, Chinese Academy of Sciences
LLMNLPMath Reason
Da Zheng
Da Zheng
Amazon
High-performance computingData-intensive computingLarge-scale machine learningGraph neural networks
B
Boyuan Chen
Institute for Artificial Intelligence, Peking University
Y
Yi Hu
Institute for Artificial Intelligence, Peking University
Shijia Kang
Shijia Kang
Peking University
LLMs
J
Jiaming Ji
Institute for Artificial Intelligence, Peking University
Y
Yingying Zhang
East China Normal University
Zhijiang Guo
Zhijiang Guo
HKUST (GZ) | HKUST
Natural Language ProcessingMachine LearningLarge Language Models
Y
Yaodong Yang
Institute for Artificial Intelligence, Peking University
Muhan Zhang
Muhan Zhang
Peking University
Machine LearningGraph Neural NetworkLarge Language Models
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning