🤖 AI Summary
This study investigates whether large-scale long-chain-of-thought (Long-CoT) data can elicit “slow thinking” capabilities in large language models. To this end, we propose RedStar, the first framework to scale Long-CoT data to one million samples, and empirically demonstrate that even a small number of high-difficulty examples suffices to substantially activate deep reasoning. Methodologically, RedStar introduces an RL-scale training paradigm integrating multi-scale supervised fine-tuning, large-scale reinforcement learning, and joint code–mathematical modeling. Experiments show that RedStar-code-math achieves 81.6% accuracy on MATH-Hard (+15.4 percentage points) and solves 46.7% of AIME problems—using only 21K samples. RedStar-Geo outperforms QvQ-Preview on multimodal geometric reasoning benchmarks including GeoQA. Collectively, these results validate a synergistic pathway wherein both data scale and quality jointly drive slow-thinking capability enhancement.
📝 Abstract
Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME), it solves 46.7% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.