🤖 AI Summary
To address the high computational cost of dynamic programming (DP) in large-scale state-action spaces and problems with long-term dependencies, this paper proposes BellNet: a novel framework that models policy iteration as a cascade of learnable nonlinear graph filters, thereby reformulating value function iteration from a graph signal processing perspective for the first time. Grounded in Markov decision processes, BellNet employs differentiable graph filters to enable end-to-end training, optimizing parameters by minimizing the Bellman error. Its key contributions include learnability, cross-task transferability, and controllable computational complexity—leading to a substantial reduction in required iterations. Empirical evaluation on grid-world environments demonstrates that BellNet achieves near-optimal policies using only a small fraction of the iterations required by classical DP algorithms, while significantly improving inference efficiency.
📝 Abstract
Dynamic programming (DP) is a fundamental tool used across many engineering fields. The main goal of DP is to solve Bellman's optimality equations for a given Markov decision process (MDP). Standard methods like policy iteration exploit the fixed-point nature of these equations to solve them iteratively. However, these algorithms can be computationally expensive when the state-action space is large or when the problem involves long-term dependencies. Here we propose a new approach that unrolls and truncates policy iterations into a learnable parametric model dubbed BellNet, which we train to minimize the so-termed Bellman error from random value function initializations. Viewing the transition probability matrix of the MDP as the adjacency of a weighted directed graph, we draw insights from graph signal processing to interpret (and compactly re-parameterize) BellNet as a cascade of nonlinear graph filters. This fresh look facilitates a concise, transferable, and unifying representation of policy and value iteration, with an explicit handle on complexity during inference. Preliminary experiments conducted in a grid-like environment demonstrate that BellNet can effectively approximate optimal policies in a fraction of the iterations required by classical methods.