Residual Off-Policy RL for Finetuning Behavior Cloning Policies

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

High-degree-of-freedom robots face low sample efficiency, difficulty optimizing under sparse rewards, and insufficient safety guarantees during real-world reinforcement learning (RL) training. Method: We propose a residual offline RL fine-tuning framework: a behavior cloning (BC) policy serves as a fixed base, and only lightweight per-step residual corrections are learned—requiring neither dense reward signals nor online interaction, but only sparse binary rewards. Contribution/Results: This is the first work to achieve end-to-end RL training for embodied dexterous humanoid hands in real-world settings, significantly alleviating bottlenecks in sample efficiency and long-horizon task learning. The method attains state-of-the-art performance on both simulated and real-world visuomotor control tasks, demonstrating its effectiveness in high-dimensional systems and feasibility for practical deployment.

Technology Category

Application Category

📝 Abstract

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io

Problem

Research questions and friction points this paper is trying to address.

Improving behavior cloning policies limited by human demonstration quality

Overcoming RL challenges like sample inefficiency and safety concerns

Enabling effective RL training on high-degree-of-freedom real-world systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual learning framework combining BC and RL

Leverages BC policies as black-box bases

Learns lightweight per-step residual corrections

🔎 Similar Papers

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

2024-06-27Conference on Empirical Methods in Natural Language ProcessingCitations: 1