🤖 AI Summary
To address the limited recursive reasoning capability of large language models (LLMs) in long-chain-of-thought (CoT) generation and their reliance on expert-annotated data, this paper proposes Alignment via Refinement (AvR). AvR introduces a novel refinement process modeling framework coupled with a refinement-aware differentiable reward optimization mechanism. It enables iterative critic-improve cycles to autonomously generate high-quality, extended CoT sequences at test time, dynamically scaling reasoning steps as needed. Crucially, AvR requires no human annotations—using only 3K synthetic samples, it boosts the win rate of LLaMA-3-8B-Instruct on AlpacaEval 2.0 by over 20 percentage points, substantially outperforming conventional preference optimization methods. This demonstrates strong generalization in recursive reasoning under few-shot settings.
📝 Abstract
The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose extbf{AvR}: extbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize extbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).