🤖 AI Summary
This paper addresses realistic reinforcement learning scenarios where trajectory termination times are both stochastic and policy-dependent—challenging the conventional assumptions of fixed or infinite horizons. Method: We develop a dual modeling framework integrating trajectory- and state-space perspectives, grounded in stochastic process theory and optimal control principles, and propose a Monte Carlo–based gradient estimation scheme. Contribution/Results: (1) We derive the first unbiased policy gradient theorem for policy-dependent stochastic horizons; (2) we unify the theoretical foundations of policy optimization across finite-, infinite-, and stochastic-horizon settings; and (3) empirical evaluations demonstrate that the proposed gradient estimator significantly improves convergence speed and training stability compared to classical methods—including those assuming fixed or discounted infinite horizons—thereby enabling more robust and efficient learning in real-world sequential decision-making problems with uncertain termination.
📝 Abstract
We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.