🤖 AI Summary
Existing reinforcement learning (RL) algorithms rely on high-frequency decision-making, limiting their applicability to real-world low-frequency control tasks. To address this, we propose Sequence Reinforcement Learning (SRL), a model–Actor-Critic dual-timescale framework that generates state-dependent action sequences, enabling model-agnostic, low-frequency, and computationally efficient control. Our key innovations include: (i) a “temporal backtracking” mechanism that leverages the learned dynamics model to estimate intermediate states within each action sequence, enabling fine-grained credit assignment; (ii) a lightweight paradigm wherein the dynamics model is used only during training—deployment is model-free; and (iii) the Frequency-Averaged Score (FAS), a novel evaluation metric for low-frequency control. Experiments on continuous-control benchmarks demonstrate SRL achieves state-of-the-art performance, significantly reduces Actor sampling complexity, substantially outperforms conventional RL in FAS, matches the efficacy of online model-based planning, and exhibits neurocomputational correspondence with basal ganglia action chunking.
📝 Abstract
Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a"temporal recall"mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Furthermore, we compare SRL with model-based online planning, showing that SRL achieves comparable FAS while leveraging the same model during training that online planners use for planning. Lastly, we highlight the biological relevance of SRL, showing that it replicates the"action chunking"behavior observed in the basal ganglia, offering insights into brain-inspired control mechanisms.