COPO: Consistency-Aware Policy Optimization

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In rule-based reward RLHF, multi-response result consistency causes advantage function degeneration and gradient vanishing, severely hindering training efficiency and performance of large language models on complex reasoning tasks. To address this, we propose Consistency-Aware Policy Optimization (CAPSO), a novel framework with three core contributions: (1) a global reward grounded in result consistency, preserving effective learning signals for high-consistency samples; (2) an entropy-driven soft mixing mechanism that dynamically balances local exploration and global convergence; and (3) integrated regularized advantage estimation with policy gradient optimization. Extensive experiments on multiple mathematical reasoning benchmarks demonstrate significant improvements over strong baselines, validating CAPSO’s robustness and generalizability across diverse reasoning tasks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.
Problem

Research questions and friction points this paper is trying to address.

Addresses vanishing gradients in group-based advantage functions
Ensures meaningful learning signals with high intra-group consistency
Balances local and global optimization via adaptive blending
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency-aware policy optimization framework
Entropy-based soft blending mechanism
Structured global reward design
🔎 Similar Papers
J
Jinghang Han
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
J
Jiawei Chen
College of Intelligent Robotics and Advanced Manufacturing, Fudan University; LiAuto Inc
Hang Shao
Hang Shao
Tencent Gvoice | Master, Shanghai Jiao Tong University
Speech InteractionLarge Language ModelsModel Compression
H
Hao Ma
LiAuto Inc
Mingcheng Li
Mingcheng Li
Fudan University
X
Xintian Shen
LiAuto Inc
L
Lihao Zheng
LiAuto Inc
W
Wei Chen
LiAuto Inc
T
Tao Wei
LiAuto Inc
Lihua Zhang
Lihua Zhang
Wuhan University
computational biologybioinformaticsdata mining