🤖 AI Summary
In reinforcement learning, designing reward functions for complex objectives often leads to poor behavioral alignment. This paper proposes a novel paradigm that achieves flexible behavioral alignment without modifying the original reward function—instead, it selects an appropriate recursive reward aggregation operator. Grounded in algebraic modeling, we generalize Markov decision processes into a unified framework supporting diverse aggregation operators (e.g., discounted maximum, Sharpe ratio), enabling natural adaptation of the Bellman equation while preserving compatibility with both value-based and actor-critic methods, under both deterministic and stochastic environments. Theoretically, this framework maintains policy iteration convergence guarantees under mild conditions. Empirically, our approach effectively optimizes for heterogeneous complex objectives—including risk-sensitive control, extremum-oriented planning, and long-horizon robustness—demonstrating substantial improvements in alignment generality and practical applicability across benchmark domains.
📝 Abstract
In reinforcement learning (RL), aligning agent behavior with specific objectives typically requires careful design of the reward function, which can be challenging when the desired objectives are complex. In this work, we propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function by selecting appropriate reward aggregation functions. By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the recursive generation and aggregation of rewards, allowing for the generalization of the standard discounted sum to other recursive aggregations, such as discounted max and Sharpe ratio. Our approach applies to both deterministic and stochastic settings and integrates seamlessly with value-based and actor-critic algorithms. Experimental results demonstrate that our approach effectively optimizes diverse objectives, highlighting its versatility and potential for real-world applications.