Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of enhancing mathematical reasoning capabilities in resource-constrained settings for small-scale language models (1.5B parameters). To overcome limitations in computational power and high-quality training data, we propose three key innovations: (1) the first empirical validation of Group Relative Policy Optimization (GRPO) for small-model mathematical reasoning under minimal hardware configuration (4×A40 GPUs, 24 hours); (2) construction of a compact, high-quality mathematical reasoning dataset comprising only 7,000 carefully curated samples, enabling fine-tuning at a cost of just $42; and (3) integration of an RLHF-inspired policy optimization framework. Experimental results demonstrate significant improvements: AMC23 accuracy rises from 63% to 80%, while AIME24 accuracy reaches 46.7%, surpassing o1-preview. All code and datasets are publicly released to foster reproducibility and further research.

Technology Category

Application Category

📝 Abstract
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning in small LLMs using reinforcement learning
Exploring cost-effective training under strict computational constraints
Addressing optimization instability and length constraints in training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enhances small LLMs
GRPO algorithm optimizes training efficiency
Compact dataset reduces computational costs
🔎 Similar Papers