π€ AI Summary
This work addresses the lack of a rigorous metric foundation for state similarity in multi-MDP settings. We propose the Generalized Bisimulation Metric (GBSM), the first metric framework satisfying three axioms: symmetry, cross-MDP triangle inequality, and bounded distance under isomorphism. Grounded in metric space theory, Lipschitz analysis of value function perturbations, and probabilistic convergence tools, GBSM provides the first mathematically rigorous characterization of state similarity between arbitrary MDPs. Theoretically, we derive explicit, non-asymptotic upper bounds on policy transfer error, state aggregation distortion, and sampling estimation bias, along with closed-form, finite-sample complexity guarantees. Empirically, GBSM significantly outperforms standard bisimulation metrics in cross-MDP policy transfer and state compression tasks.
π Abstract
The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.