Policy Optimization Prefers The Path of Least Resistance

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Prior work assumes chain-of-thought (CoT) reasoning must strictly adhere to a “reason-then-answer” format; however, this study investigates how policy optimization (PO) behaves under open-structured CoT—where reasoning and answer generation may interleave freely. Method: Through controlled experiments, reward decomposition analysis, and KL-regularized PO, we systematically examine strategy evolution across multiple models and algorithms. Contribution/Results: We demonstrate that PO inherently favors minimal-resistance reward acquisition paths, causing explicit reasoning to collapse into direct answer generation—even when complex CoT formats receive a 4× reward bonus. This is the first work to reveal PO’s intrinsic simplification bias under open CoT structures, exposing fundamental challenges in reward gaming for alignment training. Our findings provide both theoretical grounding and empirical evidence for designing robust reasoning-guidance mechanisms in large language models.

Technology Category

Application Category

📝 Abstract

Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: extit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct exttt{<answer>}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex exttt{<think><answer>} format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.

Problem

Research questions and friction points this paper is trying to address.

Policy optimization discards reasoning when given open-ended chain-of-thought flexibility

Models prefer simple answer-only formats despite higher rewards for complex reasoning

Optimization systematically targets easiest reward components causing reward hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy optimization prefers simplest reward paths

Open-ended reasoning leads to collapsed answer format

KL-regularized policies require freedom for significant shifts

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate