Revisiting Adam for Streaming Reinforcement Learning

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the instability inherent in streaming reinforcement learning settings characterized by the absence of replay buffers and purely online training. Through a systematic analysis of algorithms such as DQN and C51 under this paradigm, the study identifies gradient boundedness and variance adaptivity as critical factors for stable training. Building on these insights, the authors propose Adaptive Q(λ), a novel algorithm that integrates eligibility traces, a distributional reinforcement learning framework, and a variance-adaptive update mechanism, optimized via the Adam optimizer. Empirical evaluation demonstrates that C51 achieves performance comparable to StreamQ across 55 Atari games, while Adaptive Q(λ) attains an average score nearly twice the human baseline, substantially outperforming existing approaches.

📝 Abstract

Learning from a sequence of interactions, as soon as observations are perceived and acted upon, without explicitly storing them, holds the promise of simpler, more efficient and adaptive algorithms. For over a decade, however, deep reinforcement learning walked the contrary path, augmenting agents with replay buffers or parallel sampling routines, in an effort to tame learning instability. Recently, this topic has been revisited by Elsayed et al. (2024), focusing on update computation through eligibility traces and modifications to the optimisation routine, resulting in the StreamQ algorithm. In this work we take a step back, investigating the efficacy of established updates, such as those implemented by DQN and C51 within this online setting. Not only do we find that they perform well, but through analysing how the optimisation algorithm generally, and Adam in particular, interacts with these updates, we contend that two properties are essential for robust performance: i) the derivative of the objective is to be bounded and ii) weight updates are variance-adjusted. Rigorous and exhaustive experimentation demonstrates that C51, which exhibits both characteristics, is competitive with StreamQ across a subset of 55 Atari games. Using these insights, we derive a variance-adjusted algorithm based on eligibility traces, termed Adaptive Q$(λ)$, which approaches double the human baseline on the same subset, surpassing existing methods by all performance metrics.

Problem

Research questions and friction points this paper is trying to address.

streaming reinforcement learning

online learning

learning stability

variance-adjusted updates

bounded gradients

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming reinforcement learning

variance-adjusted updates

eligibility traces