🤖 AI Summary
Continuous-time reinforcement learning (CTRL) suffers from fundamental theoretical bottlenecks in sample and computational efficiency under general function approximation.
Method: We propose the first model-based CTRL algorithm that is simultaneously sample- and computationally efficient. Our approach establishes the first finite-sample complexity upper bound for CTRL under general function approximation; introduces structured policy updates and a novel measurement policy; and integrates optimistic confidence set construction, distributional Eluder dimension analysis, model-based dynamics modeling, and structured optimization.
Contribution/Results: Theoretically, we guarantee a suboptimality error of Õ(√(d_R + d_F)/√N), where d_R and d_F are problem-dependent dimensions characterizing reward and transition function complexity. Empirically, our algorithm achieves performance competitive with state-of-the-art baselines on continuous control and diffusion model fine-tuning tasks—using significantly fewer policy updates and trajectory rollouts—thereby enhancing practicality and scalability.
📝 Abstract
Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $ ilde{O}(sqrt{d_{mathcal{R}} + d_{mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{mathcal{R}}$ and $d_{mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. We implemented experiments to backup our proposed algorithms on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.