🤖 AI Summary
Why do some language models exhibit sustained self-improvement under reinforcement learning (RL), while others rapidly plateau? This work identifies the initial reasoning behavior pattern—not answer correctness—as the critical determinant. Specifically, four cognitive behaviors—verification, backtracking, subgoal decomposition, and backward chaining—prove decisive for both test-time reasoning and RL efficacy. We propose behavior-oriented prompt augmentation and a filtering-based continued pretraining methodology using OpenWebMath, enabling targeted behavioral supervision. Additionally, we construct a controllable behavioral dataset to isolate and modulate these reasoning patterns. Experiments show that Llama-3.2-3B, optimized via our framework, achieves self-improvement performance on the Countdown task comparable to—and in some cases exceeding—that of Qwen-2.5-3B. Crucially, even incorrect answers containing valid reasoning structures effectively drive RL learning, confirming the primacy of reasoning process over outcome in model self-refinement.
📝 Abstract
Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.