🤖 AI Summary
This work addresses the limitations of large language models in complex reasoning, where performance is often hindered by logical inconsistencies and disorganized reasoning structures. Existing reinforcement learning approaches typically rely solely on the correctness of final answers, neglecting the quality of intermediate reasoning steps. To overcome this, the paper introduces a stability-enhanced reinforcement strategy optimization framework that decomposes reasoning stability into two lightweight, computable metrics: the autocorrelation function (ACF), which quantifies local coherence between consecutive reasoning steps, and path efficiency (PE), which measures global goal-directedness. These metrics serve as process-aware rewards integrated into policy optimization. Experiments demonstrate that the proposed method significantly outperforms current baselines across four reasoning benchmarks, simultaneously improving both answer accuracy and logical stability, with ACF and PE shown to strongly correlate with logical errors.
📝 Abstract
Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.