Solving Finite-Horizon MDPs via Low-Rank Tensors

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the curse of dimensionality and poor sample efficiency in learning high-dimensional nonstationary value functions for finite-horizon Markov decision processes (MDPs), this paper proposes a low-rank tensor modeling framework. We formulate the value function as a time-varying low-rank tensor to explicitly capture its structured temporal dependencies. Methodologically, we design a block-coordinate gradient descent (BCGD) algorithm with theoretical convergence guarantees, enabling value estimation from sampled trajectories without knowledge of the underlying dynamics; the approach jointly integrates low-rank tensor decomposition with constrained Bellman equation optimization. Experiments on synthetic benchmarks and real-world resource allocation tasks demonstrate that our method reduces computational overhead by one to two orders of magnitude while preserving near-optimality of the induced policy, thereby validating its scalability and practical utility.

Technology Category

Application Category

📝 Abstract

We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs) using low-rank reinforcement learning (RL) methods. In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary. This aggravates the challenges of high-dimensional MDPs, as they suffer from the curse of dimensionality and high sample complexity. To address these issues, we propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable. We introduce an optimization-based framework for solving the Bellman equations with low-rank constraints, along with block-coordinate descent (BCD) and block-coordinate gradient descent (BCGD) algorithms, both with theoretical convergence guarantees. For scenarios where the system dynamics are unknown, we adapt the proposed BCGD method to estimate the VFs using sampled trajectories. Numerical experiments further demonstrate that the proposed framework reduces computational demands in controlled synthetic scenarios and more realistic resource allocation problems.

Problem

Research questions and friction points this paper is trying to address.

Time-constrained decision making

Dynamic value optimization

Efficient learning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-constrained Decision-making

Matrix Simplification

Efficient Resource Allocation

🔎 Similar Papers

No similar papers found.