🤖 AI Summary
This work addresses the fundamental limitation in multi-objective reinforcement learning (MORL) that scalar rewards cannot fully represent complex preference structures. We propose a formal axiomatic framework for lexicographic Markov decision processes (Lex-MDPs), grounded in Hausner’s lexicographic utility theory and preference axiomatization under memoryless assumptions. First, we derive practical necessary and sufficient conditions for identifying when preferences are *not* scalar-reward-expressible. Second, we fully characterize the structure of lexicographic reward functions in two and general $d$-dimensional settings. Third, we prove that Lex-MDPs admit optimal policies that are guaranteed to exist, deterministic, and computable via value iteration—properties that fail to hold in constrained MDPs (CMDPs). Collectively, these results delineate the theoretical boundaries of scalar reward modeling and provide a rigorous, interpretable foundation for reward design in safety-critical MORL applications.
📝 Abstract
Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting -- another common multiobjective setting -- they do not.