🤖 AI Summary
This work addresses policy optimization for Markov decision processes (MDPs) with unknown transition probabilities. We introduce a geometric normalization perspective: we define and construct a family of value-function transformations that preserve action advantages under any policy, thereby establishing a reward-balancing framework for optimal policy computation. Iterative application of these transformations yields a class of sampling-based reward-balancing algorithms. Theoretically, our approach breaks the existing sample-complexity lower bound without requiring prior knowledge of the model. Empirically, it significantly improves convergence speed and policy robustness while enabling direct policy extraction—without auxiliary re-scaling. Our core contribution is the first characterization of MDPs as advantage-invariant geometric structures, which simultaneously enhances both sample efficiency and theoretical guarantees.
📝 Abstract
We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call Reward Balancing, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.