Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the fundamental limitation in multi-objective reinforcement learning (MORL) that scalar rewards cannot fully represent complex preference structures. We propose a formal axiomatic framework for lexicographic Markov decision processes (Lex-MDPs), grounded in Hausner’s lexicographic utility theory and preference axiomatization under memoryless assumptions. First, we derive practical necessary and sufficient conditions for identifying when preferences are *not* scalar-reward-expressible. Second, we fully characterize the structure of lexicographic reward functions in two and general $d$-dimensional settings. Third, we prove that Lex-MDPs admit optimal policies that are guaranteed to exist, deterministic, and computable via value iteration—properties that fail to hold in constrained MDPs (CMDPs). Collectively, these results delineate the theoretical boundaries of scalar reward modeling and provide a rigorous, interpretable foundation for reward design in safety-critical MORL applications.

Technology Category

Application Category

📝 Abstract

Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting -- another common multiobjective setting -- they do not.

Problem

Research questions and friction points this paper is trying to address.

Extends utility theory to lexicographic MDPs with vector rewards

Identifies conditions requiring multi-dimensional reward functions

Compares properties of optimal policies in lexicographic vs constrained MDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends utility theory with lexicographic vector rewards

Characterizes 2D and d-dimensional reward functions

Analyzes optimal policies in lexicographic MDPs

🔎 Similar Papers

Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework