GDEPO: Group Dual-dynamic and Equal-right Advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiencies in existing reinforcement learning methods—such as GRPO—for automated theorem proving, where conflicting composite rewards and binary feedback from formal verifiers, combined with static sampling strategies, lead to excessive discarding of invalid samples and poor data efficiency. To overcome these limitations, the authors propose GDEPO, a novel approach that integrates dynamic resampling, sign-magnitude decoupled advantage estimation (which separates the sign and magnitude components of the advantage function), and dynamically augmented gradient iterations. These mechanisms collectively enable zero-waste data utilization and more stable policy optimization under sample-constrained conditions. Experimental results demonstrate that GDEPO significantly outperforms prior methods on MinF2F-test, MathOlympiadBench, and PutnamBench, with ablation studies confirming the individual efficacy and synergistic contributions of its core components.

Technology Category

Application Category

📝 Abstract
Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.
Problem

Research questions and friction points this paper is trying to address.

Automated Theorem Proving
Reinforcement Learning
Sample Efficiency
Policy Optimization
Data Utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic additional sampling
equal-right advantage
dynamic additional iterations
sample-constrained reinforcement learning
automated theorem proving
🔎 Similar Papers
No similar papers found.
Z
Zhengqing Yan
State Key Laboratory of Engines, Tianjin University, 300354 Tianjin, China
X
Xinyan Liu
AI Business Center, Xinyan Group, 100080 Beijing, China
Yi Zhang
Yi Zhang
University of Electronic Science and Technology of China
Fan Guo
Fan Guo
Los Alamos National Laboratory
Particle accelerationMagnetic ReconnectionCosmic raysPlasma AstrophysicsSpace Physics
Y
Yao Liu
AI Business Center, Xinyan Group, 100080 Beijing, China
J
Junchen Wan
AI Business Center, Xinyan Group, 100080 Beijing, China
K
Kang Song
State Key Laboratory of Engines, Tianjin University, 300354 Tianjin, China