🤖 AI Summary
This work addresses the long-standing challenge of credit assignment over extended sequences in streaming reinforcement learning under partial observability, where truncated backpropagation through time (BPTT) severely limits gradient horizons. The authors propose a novel streaming reinforcement learning method based on Recurrent Trace Units (RTUs), which for the first time enables exact real-time recurrent learning (RTRL) without replay buffers or batch updates. Leveraging a diagonal recurrent architecture, the approach incurs only linear time and memory complexity. It is compatible with both discrete and continuous control and supports online policy optimization. Empirical results demonstrate that the method significantly outperforms existing streaming baselines on MemoryChain, POPGym, and partially observable MuJoCo tasks, while closely approaching the performance of batch-based PPO.
📝 Abstract
Streaming reinforcement learning has emerged as an online learning paradigm that conforms to the restrictions of natural learning agents that process data incrementally, i.e. with a batch size of 1 and no replay buffer. While streaming RL has recently been shown to scale with deep function approximation with full observability, partially observable settings have remained out of reach. Truncated backpropagation through time collapses to a one-step gradient horizon under the streaming setting, and exact real-time recurrent learning is prohibitively expensive. We close this gap using recurrent trace units, a diagonal recurrent architecture that enables exact RTRL with linear time and memory complexity in the parameter count, and show that they integrate cleanly into existing streaming algorithms across both discrete and continuous control. On a MemoryChain diagnostic with chain lengths from 2 to 128, our method sustains performance where streaming TBPTT(1) baselines using feedforward, GRU, and RTU networks collapse. On five POPGym tasks and on partially observable MuJoCo continuous control, the streaming approach is competitive with batched PPO on POPGym and recovers a substantial fraction of batched performance on masked MuJoCo, despite using no replay buffer or batched updates.