🤖 AI Summary
Conventional wisdom holds that small language models (SLMs) are inherently limited in mathematical reasoning capability due to their restricted parameter count. Method: This paper introduces VibeThinker-1.5B—a 1.5-billion-parameter model—grounded in the Spectrum-to-Signal Principle (SSP), which jointly optimizes reasoning through two synergistic mechanisms: (i) two-stage diversity-aware distillation for data-level knowledge transfer, and (ii) maximum-entropy-guided policy optimization for strategy-level refinement. Contribution/Results: VibeThinker-1.5B achieves state-of-the-art mathematical reasoning performance on AIME24/25, HMMT25, and LiveCodeBench v6—surpassing DeepSeek R1 (400× larger) and Magistral Medium, with a LiveCodeBench score of 51.1. Crucially, it attains this with only $7,800 in training cost, drastically reducing computational requirements and establishing a new paradigm for efficient, high-performance reasoning in compact models.
📝 Abstract
Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium's 50.3 and its base model's 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.