Unlocking Recursive Thinking of LLMs: Alignment via Refinement

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the limited recursive reasoning capability of large language models (LLMs) in long-chain-of-thought (CoT) generation and their reliance on expert-annotated data, this paper proposes Alignment via Refinement (AvR). AvR introduces a novel refinement process modeling framework coupled with a refinement-aware differentiable reward optimization mechanism. It enables iterative critic-improve cycles to autonomously generate high-quality, extended CoT sequences at test time, dynamically scaling reasoning steps as needed. Crucially, AvR requires no human annotations—using only 3K synthetic samples, it boosts the win rate of LLaMA-3-8B-Instruct on AlpacaEval 2.0 by over 20 percentage points, substantially outperforming conventional preference optimization methods. This demonstrates strong generalization in recursive reasoning under few-shot settings.

Technology Category

Application Category

📝 Abstract

The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose extbf{AvR}: extbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize extbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' recursive thinking without expert data

Optimizing refinement-aware rewards via differentiable learning

Improving model performance with synthetic multi-round data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment via Refinement (AvR) method

Differentiable learning for refinement-aware rewards

Synthetic multi-round data organization

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings