🤖 AI Summary
This work addresses the poor sample efficiency of streaming reinforcement learning (RL), which stems from the absence of experience replay and the difficulty of learning effective representations from highly correlated, non-stationary data. To this end, the authors introduce, for the first time, a stable integration of self-supervised representation learning into streaming RL by proposing an orthogonal gradient update mechanism based on Self-Predictive Representations (SPR), combined with a momentum target network. This design effectively decouples representation learning from policy gradient updates, thereby mitigating training instability. The method substantially improves data utilization from single-pass observations, significantly outperforming existing streaming approaches on Atari, MinAtar, and Octax benchmarks. Notably, the learned representations—validated via t-SNE visualization and effective rank analysis—approach the quality of those obtained by replay-buffer-based algorithms, while requiring only a small number of CPU cores for efficient training.
📝 Abstract
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.