What Matters for Batch Online Reinforcement Learning in Robotics?

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In batch online reinforcement learning for robotics, low autonomous data utilization efficiency and stagnating policy improvement hinder scalability. Method: We propose a novel paradigm that systematically identifies three critical factors—Q-function guidance, implicit policy extraction, and highly expressive policy classes—and introduce a general training recipe integrating Q-learning, action-distribution-based implicit policy extraction, high-capacity policy networks (e.g., Transformers or MLPs), and temporally correlated noise injection to enhance data diversity. Contribution/Results: Evaluated on multi-robot tasks, our method significantly outperforms state-of-the-art imitation learning and offline RL baselines in both final policy performance and data scalability, markedly reducing reliance on manually collected demonstration data while enabling sustained policy improvement through efficient autonomous data exploitation.

Technology Category

Application Category

📝 Abstract
The ability to learn from large batches of autonomously collected data for policy improvement -- a paradigm we refer to as batch online reinforcement learning -- holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online RL in robotics. Motivated by this question, we perform a systematic empirical study of three axes -- (i) algorithm class, (ii) policy extraction methods, and (iii) policy expressivity -- and analyze how these axes affect performance and scaling with the amount of autonomous data. Through our analysis, we make several observations. First, we observe that the use of Q-functions to guide batch online RL significantly improves performance over imitation-based methods. Building on this, we show that an implicit method of policy extraction -- via choosing the best action in the distribution of the policy -- is necessary over traditional policy extraction methods from offline RL. Next, we show that an expressive policy class is preferred over less expressive policy classes. Based on this analysis, we propose a general recipe for effective batch online RL. We then show a simple addition to the recipe of using temporally-correlated noise to obtain more diversity results in further performance gains. Our recipe obtains significantly better performance and scaling compared to prior methods.
Problem

Research questions and friction points this paper is trying to address.

Identifying key factors for effective batch online RL in robotics
Comparing algorithm classes and policy extraction methods for performance
Proposing a recipe for scalable autonomous robot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Q-functions to guide batch online RL
Implements implicit policy extraction method
Employs expressive policy class for better performance