Solving Finite-Horizon MDPs via Low-Rank Tensors

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the curse of dimensionality and poor sample efficiency in learning high-dimensional nonstationary value functions for finite-horizon Markov decision processes (MDPs), this paper proposes a low-rank tensor modeling framework. We formulate the value function as a time-varying low-rank tensor to explicitly capture its structured temporal dependencies. Methodologically, we design a block-coordinate gradient descent (BCGD) algorithm with theoretical convergence guarantees, enabling value estimation from sampled trajectories without knowledge of the underlying dynamics; the approach jointly integrates low-rank tensor decomposition with constrained Bellman equation optimization. Experiments on synthetic benchmarks and real-world resource allocation tasks demonstrate that our method reduces computational overhead by one to two orders of magnitude while preserving near-optimality of the induced policy, thereby validating its scalability and practical utility.

Technology Category

Application Category

📝 Abstract
We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs) using low-rank reinforcement learning (RL) methods. In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary. This aggravates the challenges of high-dimensional MDPs, as they suffer from the curse of dimensionality and high sample complexity. To address these issues, we propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable. We introduce an optimization-based framework for solving the Bellman equations with low-rank constraints, along with block-coordinate descent (BCD) and block-coordinate gradient descent (BCGD) algorithms, both with theoretical convergence guarantees. For scenarios where the system dynamics are unknown, we adapt the proposed BCGD method to estimate the VFs using sampled trajectories. Numerical experiments further demonstrate that the proposed framework reduces computational demands in controlled synthetic scenarios and more realistic resource allocation problems.
Problem

Research questions and friction points this paper is trying to address.

Time-constrained decision making
Dynamic value optimization
Efficient learning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-constrained Decision-making
Matrix Simplification
Efficient Resource Allocation
🔎 Similar Papers
No similar papers found.