Truncated Proximal Policy Optimization

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the inefficiency and poor hardware utilization of Proximal Policy Optimization (PPO) in training large language models (LLMs) for long-chain-of-thought (CoT) reasoning—caused by online policy updates and lengthy sequence generation—this paper proposes a Truncated Policy Optimization (TPO) framework. Methodologically, it introduces: (1) Extended Generalized Advantage Estimation (EGAE), enabling robust advantage computation even for incomplete responses; (2) decoupled and asynchronous optimization of policy and value networks to eliminate redundant computation; and (3) prompt- and token-level selective filtering mechanisms. Crucially, TPO enables stable policy updates under response truncation—a first in PPO-based LLM alignment. Empirically, on the AIME 2024 dataset, TPO accelerates training of a 32B-parameter model by 2.5× while achieving superior convergence performance compared to existing PPO variants.

Technology Category

Application Category

📝 Abstract

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.

Problem

Research questions and friction points this paper is trying to address.

Improves training efficiency of large language models

Reduces idle hardware time during long rollouts

Optimizes policy and value models separately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Truncated PPO enhances training efficiency

Extended Generalized Advantage Estimation method

Independent policy and value model optimization

🔎 Similar Papers

No similar papers found.

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

Post-Training Platform Infrastructure Engineer

AMD

San Jose, CA (Hybrid) / other US locations

Authors to Follow