Vector-Valued Distributional Reinforcement Learning Policy Evaluation: A Hilbert Space Embedding Approach

πŸ“… 2026-01-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of off-policy evaluation in distributional reinforcement learning within high-dimensional continuous state-action spaces, where traditional approaches struggle due to the computational intractability of Wasserstein distance. To overcome this limitation, we propose the KE-DRL framework, which introduces kernel mean embeddings into distributional reinforcement learning for the first time. By replacing Wasserstein distance with an integral probability metric in a reproducing kernel Hilbert space, our method significantly improves computational efficiency. Leveraging the MatΓ©rn kernel, we establish contraction properties of the distributional Bellman operator and provide uniform convergence guarantees under assumptions of kernel Lipschitz continuity and boundedness. Empirical results demonstrate that KE-DRL enables robust off-policy evaluation and accurate recovery of value distributions in complex decision-making and risk-sensitive tasks.

Technology Category

Application Category

πŸ“ Abstract
We propose an (offline) multi-dimensional distributional reinforcement learning framework (KE-DRL) that leverages Hilbert space mappings to estimate the kernel mean embedding of the multi-dimensional value distribution under a proposed target policy. In our setting, the state-action variables are multi-dimensional and continuous. By mapping probability measures into a reproducing kernel Hilbert space via kernel mean embeddings, our method replaces Wasserstein metrics with an integral probability metric. This enables efficient estimation in multi-dimensional state-action spaces and reward settings, where direct computation of Wasserstein distances is computationally challenging. Theoretically, we establish contraction properties of the distributional Bellman operator under our proposed metric involving the Matern family of kernels and provide uniform convergence guarantees. Simulations and empirical results demonstrate robust off-policy evaluation and recovery of the kernel mean embedding under mild assumptions, namely, Lipschitz continuity and boundedness of the kernels, highlighting the potential of embedding-based approaches in complex real-world decision-making scenarios and risk evaluation.
Problem

Research questions and friction points this paper is trying to address.

distributional reinforcement learning
off-policy evaluation
multi-dimensional value distribution
Hilbert space embedding
Wasserstein distance
Innovation

Methods, ideas, or system contributions that make the work stand out.

kernel mean embedding
distributional reinforcement learning
Hilbert space embedding
integral probability metric
off-policy evaluation
πŸ”Ž Similar Papers