Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

📅 2025-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward-based optimization methods for large language models (LLMs) in reasoning tasks are vulnerable to reward hacking, while preference-based approaches (e.g., DPO) underperform proximal policy optimization (PPO) in reasoning capability and stability. Method: We propose Trust-Region Preference Optimization (TRPO), a reward-free framework within the RLHF paradigm that eliminates explicit reward modeling. TRPO introduces a novel rule-guided hierarchical preference construction mechanism and jointly optimizes policy updates under trust-region constraints and preference loss minimization. Contribution/Results: TRPO theoretically guarantees monotonic policy improvement while balancing human alignment and strong reasoning performance. Empirical evaluation across diverse reasoning benchmarks shows TRPO matches PPO’s performance, significantly outperforms Online DPO, achieves greater training stability, and completely eliminates reward hacking.

Technology Category

Application Category

📝 Abstract
Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking in LLM reinforcement learning
Improves preference-based optimization for reasoning tasks
Combines rule-based and preference-based optimization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates rule-based and preference-based optimization
Uses predefined rules for preference levels
Ensures monotonic improvement with novel RL algorithm
🔎 Similar Papers
No similar papers found.