RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study investigates whether large-scale long-chain-of-thought (Long-CoT) data can elicit “slow thinking” capabilities in large language models. To this end, we propose RedStar, the first framework to scale Long-CoT data to one million samples, and empirically demonstrate that even a small number of high-difficulty examples suffices to substantially activate deep reasoning. Methodologically, RedStar introduces an RL-scale training paradigm integrating multi-scale supervised fine-tuning, large-scale reinforcement learning, and joint code–mathematical modeling. Experiments show that RedStar-code-math achieves 81.6% accuracy on MATH-Hard (+15.4 percentage points) and solves 46.7% of AIME problems—using only 21K samples. RedStar-Geo outperforms QvQ-Preview on multimodal geometric reasoning benchmarks including GeoQA. Collectively, these results validate a synergistic pathway wherein both data scale and quality jointly drive slow-thinking capability enhancement.

Technology Category

Application Category

📝 Abstract

Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME), it solves 46.7% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.

Problem

Research questions and friction points this paper is trying to address.

Deep Learning

Cognitive Systems

Data Scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-chain Thinking

Reinforcement Learning

Adaptability and Generalization

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting