🤖 AI Summary
Traditional reinforcement learning (RL) relies on expected utility theory, rendering it inadequate for modeling human irrational risk preferences. This paper deeply integrates Cumulative Prospect Theory (CPT) into the RL framework, deriving the first policy gradient theorem under CPT objectives—thereby departing from the classical expectation assumption. We further propose a model-free, scalable CPT policy gradient algorithm that requires no environmental dynamics model and handles high-dimensional state spaces. Empirical evaluation demonstrates substantial performance gains over existing zeroth-order CPT-RL methods in simulation. Our key contributions are: (1) establishing the theoretical foundation of CPT-RL and deriving its policy gradient update rule; (2) presenting the first general-purpose, differentiable, and scalable CPT-based policy optimization algorithm; and (3) introducing a novel paradigm for behaviorally aligned, trustworthy RL that balances theoretical rigor with practical deployability.
📝 Abstract
Classical reinforcement learning (RL) typically assumes rational decision-making based on expected utility theory. However, this model has been shown to be empirically inconsistent with actual human preferences, as evidenced in psychology and behavioral economics. Cumulative Prospect Theory (CPT) provides a more nuanced model for human-based decision-making, capturing diverse attitudes and perceptions toward risk, gains, and losses. While prior work has integrated CPT with RL to solve a CPT policy optimization problem, the understanding and practical impact of this formulation remain limited. We revisit the CPT-RL framework, offering new theoretical insights into the nature of optimal policies. We further derive a novel policy gradient theorem for CPT objectives, generalizing the foundational result in standard RL. Building on this theorem, we design a model-free policy gradient algorithm for solving the CPT-RL problem and demonstrate its performance through simulations. Notably, our algorithm scales better to larger state spaces compared to existing zeroth-order methods. This work advances the integration of behavioral decision-making into RL.