Constrained Group Relative Policy Optimization

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the challenge that existing actor-only policy learning methods, such as GRPO, struggle to effectively handle explicit behavioral constraints. To overcome this limitation, the authors propose Constrained GRPO, which incorporates constraints via Lagrangian relaxation by modeling them as indicator cost functions. A scalarized advantage function is designed to jointly integrate reward and constraint signals, ensuring effective propagation of Lagrange multipliers. Furthermore, to mitigate constraint violations caused by mismatched standard deviations in multi-component advantage estimation, a novel scalarized advantage construction mechanism is introduced. Experimental results on grid-world and robotic tasks demonstrate that the proposed method not only improves task success rates but also significantly enhances constraint satisfaction.

Technology Category

Application Category

📝 Abstract

While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.

Problem

Research questions and friction points this paper is trying to address.

Constrained Policy Optimization

Group Relative Policy Optimization

Behavioral Constraints

Lagrangian Relaxation

Embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained Policy Optimization

Group Relative Policy Optimization

Lagrangian Relaxation