Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor sample efficiency of streaming reinforcement learning (RL), which stems from the absence of experience replay and the difficulty of learning effective representations from highly correlated, non-stationary data. To this end, the authors introduce, for the first time, a stable integration of self-supervised representation learning into streaming RL by proposing an orthogonal gradient update mechanism based on Self-Predictive Representations (SPR), combined with a momentum target network. This design effectively decouples representation learning from policy gradient updates, thereby mitigating training instability. The method substantially improves data utilization from single-pass observations, significantly outperforming existing streaming approaches on Atari, MinAtar, and Octax benchmarks. Notably, the learned representations—validated via t-SNE visualization and effective rank analysis—approach the quality of those obtained by replay-buffer-based algorithms, while requiring only a small number of CPU cores for efficient training.

Technology Category

Application Category

📝 Abstract
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
Problem

Research questions and friction points this paper is trying to address.

streaming reinforcement learning
sample inefficiency
representation learning
transient data
replay buffer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Reinforcement Learning
Self-Predictive Representations
Orthogonal Gradient Updates
Sample Efficiency
Representation Learning
🔎 Similar Papers
No similar papers found.
N
Nilaksh
Chandar Research Lab, Mila - Quebec AI Institute, Polytechnique Montréal
A
Antoine Clavaud
Chandar Research Lab, Mila - Quebec AI Institute, Polytechnique Montréal
Mathieu Reymond
Mathieu Reymond
Mila - Quebec AI Institute
reinforcement learningdeep reinforcement learningmulti-objective reinforcement learning
F
François Rivest
Royal Military College of Canada, Mila - Quebec AI Institute
Sarath Chandar
Sarath Chandar
Associate Professor @ Polytechnique Montreal. Mila. Canada CIFAR AI Chair. Canada Research Chair.
Artificial IntelligenceMachine LearningDeep LearningReinforcement LearningNLP