XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the challenge of costly online exploration in real-world reinforcement learning, where existing methods struggle to effectively leverage expert demonstrations and pretrained policies. The authors propose an improved fast Actor-Critic algorithm that enhances sample efficiency by augmenting the replay buffer with demonstration data, initializing the policy from a pretrained model, and employing a static policy network architecture. This design effectively mitigates catastrophic forgetting of strong initial policies under low update-to-data ratios, while simultaneously improving policy entropy and out-of-distribution generalization—without requiring ensemble networks. The approach achieves state-of-the-art performance and sample efficiency on sparse-reward, complex manipulation benchmarks, including Adroit, Robomimic, and MimicGen.
📝 Abstract
For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks -- notably with a low update-to-data ratio and no ensemble networks
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
sample efficiency
expert demonstrations
sparse rewards
pretrained policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

sample efficiency
pretrained policies
stationary policy architecture
out-of-distribution generalization
replay buffer augmentation