ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing GRPO algorithms, which employ a static trust region and struggle to adapt to the heterogeneous signal quality inherent in outcome-based learning, often leading to underutilization of high-value signals, insufficient noise suppression, and rapid policy entropy collapse. To overcome these issues, the authors propose an Elastic Trust Region (ETR) mechanism featuring a novel two-level adaptive strategy: at the micro level, clipping boundaries are dynamically adjusted based on advantage magnitudes; at the macro level, update budgets are allocated according to intra-group variance. This approach aligns policy update intensity with signal confidence without requiring an additional critic network. Evaluated on the AIME and MATH benchmarks, ETR significantly outperforms GRPO, achieving higher accuracy while effectively mitigating policy entropy degradation and preserving long-term exploration capability.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an important paradigm for unlocking reasoning capabilities in large language models, exemplified by the success of OpenAI o1 and DeepSeek-R1. Currently, Group Relative Policy Optimization (GRPO) stands as the dominant algorithm in this domain due to its stable training and critic-free efficiency. However, we argue that GRPO suffers from a structural limitation: it imposes a uniform, static trust region constraint across all samples. This design implicitly assumes signal homogeneity, a premise misaligned with the heterogeneous nature of outcome-driven learning, where advantage magnitudes and variances fluctuate significantly. Consequently, static constraints fail to fully exploit high-quality signals while insufficiently suppressing noise, often precipitating rapid entropy collapse. To address this, we propose \textbf{E}lastic \textbf{T}rust \textbf{R}egions (\textbf{ETR}), a dynamic mechanism that aligns optimization constraints with signal quality. ETR constructs a signal-aware landscape through dual-level elasticity: at the micro level, it scales clipping boundaries based on advantage magnitude to accelerate learning from high-confidence paths; at the macro level, it leverages group variance to implicitly allocate larger update budgets to tasks in the optimal learning zone. Extensive experiments on AIME and MATH benchmarks demonstrate that ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation to ensure sustained exploration.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
trust region
policy optimization
entropy collapse
signal heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic Trust Regions
Reinforcement Learning with Verifiable Rewards
Dynamic Trust Region
Policy Optimization
Entropy Collapse Mitigation
🔎 Similar Papers
No similar papers found.
S
Shijie Zhang
Qwen Applications Business Group, Alibaba Group
Kevin Zhang
Kevin Zhang
Peking University
ML
Zheyuan Gu
Zheyuan Gu
Institute of Information Engineering, Chinese Academy of Sciences
Encrypted Traffic AnalysisCybercrime
Xiang Guo
Xiang Guo
Yale
R
Rujun Guo
Qwen Applications Business Group, Alibaba Group
S
Shaoyu Liu
Qwen Applications Business Group, Alibaba Group
G
Guanjun Jiang
Qwen Applications Business Group, Alibaba Group
X
Xiaozhao Wang
Qwen Applications Business Group, Alibaba Group