Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses two failure modes of policy gradient methods in long-horizon cumulative damage decision problems—low completion rates and suboptimal policies—often caused by locally attractive actions. The work introduces a novel decomposition of failure into two orthogonal dimensions: “completability” (whether the agent reaches the terminal state) and “optimality” (policy quality conditional on completion). To mitigate these issues, the authors propose integrating linear soft penalties, action space constraints, and a time-horizon access mechanism within a PPO framework. Empirical validation on two structurally isomorphic environments—bricklayer and NBA power forward tasks—qualitatively confirms all four theoretical predictions. Across four tested time horizons, three exhibit horizon invariance, while the exception at H=15 aligns precisely with the theoretical boundary H*, thereby substantiating the framework’s validity and testability.
📝 Abstract
Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).
Problem

Research questions and friction points this paper is trying to address.

long-horizon
cumulative damage
policy gradient
completion
optimality
Innovation

Methods, ideas, or system contributions that make the work stand out.

policy gradient
cumulative damage
completion vs optimality
long-horizon decision making
dynamic programming