π€ AI Summary
This paper studies the continuous-time scalar linear-quadratic (LQ) reinforcement learning problem with state- and control-coupled volatility. Under a model-free setting and without instantaneous control rewards, we propose a novel actor-critic algorithm featuring the first dynamic adaptive exploration mechanism. We establish theoretical guarantees: the policy parameters converge to the optimal solution at an explicit rate, and the cumulative regret is bounded by $O(N^{3/4}log N)$, breaking the convergence-rate bottleneck of existing model-based approaches. Numerical simulations verify the tightness of this bound and demonstrate that our method significantly outperforms adapted model-based LQ algorithms under identical settings. Our core contributions are threefold: (i) the first model-free RL framework for stateβcontrol-coupled volatility LQ problems; (ii) the design of a dynamic exploration mechanism that adapts to learning progress; and (iii) the sharpest regret convergence rate analysis to date for such problems.
π Abstract
We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions, where states are scalar-valued and running control rewards are absent but volatilities of the state processes depend on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of an exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of $O(N^{frac{3}{4}})$ up to a logarithmic factor, where $N$ is the number of learning episodes. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.