Relating Reinforcement Learning to Dynamic Programming-Based Planning

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work aims to bridge the gap between reinforcement learning and dynamic programming in terms of objective formulation, modeling assumptions, and optimization criteria. By developing a derandomized reinforcement learning framework, it establishes both theoretical and empirical connections to value iteration and Dijkstra’s algorithm, thereby unifying cost-minimization and reward-maximization paradigms. The core contributions include identifying the equivalence conditions between these two objective formulations, demonstrating the equivalence between single-episode termination tasks and infinite-horizon learning settings, and proposing an optimization objective centered on true cost. The approach is validated in both deterministic and stochastic environments, with precise conditions under which discounting leads to objective misalignment clearly characterized. Performance comparisons are enabled through planning-oriented evaluation metrics.

Technology Category

Application Category

📝 Abstract
This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Dynamic Programming
Optimal Planning
Cost Minimization
Reward Maximization
Innovation

Methods, ideas, or system contributions that make the work stand out.

derandomized reinforcement learning
dynamic programming
true cost optimization
equivalence analysis
value iteration
🔎 Similar Papers
No similar papers found.
F
Filip V. Georgiev
Center for Applied Computing, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland
Kalle G. Timperi
Kalle G. Timperi
PhD, Postdoctoral researcher in University of Oulu
complex systemsrandom dynamical systemstheory of computation
B
Başak Sakçak
Center for Applied Computing, Faculty of Information Technology and Electrical Engineering, University of Oulu, Finland; Dept. of Advanced Computing Sciences, Maastricht University, the Netherlands
Steven M. LaValle
Steven M. LaValle
Professor of Robotics and Virtual Reality, University of Oulu, Finland
Roboticsvirtual realitysensor fusionmotion planningcontrol theory