π€ AI Summary
This work addresses the challenge of deploying deep reinforcement learning (RL) on resource-constrained edge devices, where conventional methods relying on experience replay, batch updates, and target networks are impractical. To this end, we propose two novel streaming algorithms, S2AC and SDAC, which establish the first purely online streaming framework compatible with mainstream batch RL paradigms. Our approach eliminates the need for experience replay and target networks, instead integrating soft and deterministic policy gradients to enable efficient fine-tuning. Specifically designed for Sim2Real scenarios, the framework supports a seamless transition from batch pre-training to online streaming fine-tuning. Experiments demonstrate that our method achieves performance on par with state-of-the-art streaming approaches on standard continuous control benchmarks, without requiring complex hyperparameter tuning, thereby offering both effectiveness and practicality.
π Abstract
State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.