A Prospect-Theoretic Policy Gradient Algorithm for Behavioral Alignment in Reinforcement Learning

📅 2024-10-03

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Traditional reinforcement learning (RL) relies on expected utility theory, rendering it inadequate for modeling human irrational risk preferences. This paper deeply integrates Cumulative Prospect Theory (CPT) into the RL framework, deriving the first policy gradient theorem under CPT objectives—thereby departing from the classical expectation assumption. We further propose a model-free, scalable CPT policy gradient algorithm that requires no environmental dynamics model and handles high-dimensional state spaces. Empirical evaluation demonstrates substantial performance gains over existing zeroth-order CPT-RL methods in simulation. Our key contributions are: (1) establishing the theoretical foundation of CPT-RL and deriving its policy gradient update rule; (2) presenting the first general-purpose, differentiable, and scalable CPT-based policy optimization algorithm; and (3) introducing a novel paradigm for behaviorally aligned, trustworthy RL that balances theoretical rigor with practical deployability.

Technology Category

Application Category

📝 Abstract

Classical reinforcement learning (RL) typically assumes rational decision-making based on expected utility theory. However, this model has been shown to be empirically inconsistent with actual human preferences, as evidenced in psychology and behavioral economics. Cumulative Prospect Theory (CPT) provides a more nuanced model for human-based decision-making, capturing diverse attitudes and perceptions toward risk, gains, and losses. While prior work has integrated CPT with RL to solve a CPT policy optimization problem, the understanding and practical impact of this formulation remain limited. We revisit the CPT-RL framework, offering new theoretical insights into the nature of optimal policies. We further derive a novel policy gradient theorem for CPT objectives, generalizing the foundational result in standard RL. Building on this theorem, we design a model-free policy gradient algorithm for solving the CPT-RL problem and demonstrate its performance through simulations. Notably, our algorithm scales better to larger state spaces compared to existing zeroth-order methods. This work advances the integration of behavioral decision-making into RL.

Problem

Research questions and friction points this paper is trying to address.

Aligning RL with human decision-making

Improving CPT-RL policy optimization

Scaling algorithms for larger state spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prospect-Theoretic Policy Gradient

Cumulative Prospect Theory integration

Model-free RL algorithm design

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation