Reinforcement Learning with Random Time Horizons

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses realistic reinforcement learning scenarios where trajectory termination times are both stochastic and policy-dependent—challenging the conventional assumptions of fixed or infinite horizons. Method: We develop a dual modeling framework integrating trajectory- and state-space perspectives, grounded in stochastic process theory and optimal control principles, and propose a Monte Carlo–based gradient estimation scheme. Contribution/Results: (1) We derive the first unbiased policy gradient theorem for policy-dependent stochastic horizons; (2) we unify the theoretical foundations of policy optimization across finite-, infinite-, and stochastic-horizon settings; and (3) empirical evaluations demonstrate that the proposed gradient estimator significantly improves convergence speed and training stability compared to classical methods—including those assuming fixed or discounted infinite horizons—thereby enabling more robust and efficient learning in real-world sequential decision-making problems with uncertain termination.

Technology Category

Application Category

📝 Abstract

We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.

Problem

Research questions and friction points this paper is trying to address.

Extends RL to random time horizons

Derives policy gradient formulas for random stopping times

Improves optimization convergence vs traditional approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends RL to random time horizons

Derives policy gradient formulas rigorously

Connects to optimal control theory

🔎 Similar Papers

No similar papers found.