Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental limitation in multi-objective reinforcement learning (MORL) that scalar rewards cannot fully represent complex preference structures. We propose a formal axiomatic framework for lexicographic Markov decision processes (Lex-MDPs), grounded in Hausner’s lexicographic utility theory and preference axiomatization under memoryless assumptions. First, we derive practical necessary and sufficient conditions for identifying when preferences are *not* scalar-reward-expressible. Second, we fully characterize the structure of lexicographic reward functions in two and general $d$-dimensional settings. Third, we prove that Lex-MDPs admit optimal policies that are guaranteed to exist, deterministic, and computable via value iteration—properties that fail to hold in constrained MDPs (CMDPs). Collectively, these results delineate the theoretical boundaries of scalar reward modeling and provide a rigorous, interpretable foundation for reward design in safety-critical MORL applications.

Technology Category

Application Category

📝 Abstract
Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting -- another common multiobjective setting -- they do not.
Problem

Research questions and friction points this paper is trying to address.

Extends utility theory to lexicographic MDPs with vector rewards
Identifies conditions requiring multi-dimensional reward functions
Compares properties of optimal policies in lexicographic vs constrained MDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends utility theory with lexicographic vector rewards
Characterizes 2D and d-dimensional reward functions
Analyzes optimal policies in lexicographic MDPs
🔎 Similar Papers
No similar papers found.