End-to-end RL Improves Dexterous Grasping Policies

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the scalability bottleneck in image-based end-to-end dexterous grasping policy training—caused by tight coupling between visual simulation and reinforcement learning—this paper proposes a decoupled architecture that separates high-cost visual simulation from RL training, enabling cross-GPU asynchronous parallel simulation and large-batch PPO optimization. Our method integrates depth-aware policy distillation, experience replay buffer resampling, and a stereo RGB network design, achieving, for the first time, efficient knowledge transfer from depth-conditioned policies to purely vision-based ones. Under identical hardware, the simulation environment scale doubles and training throughput increases significantly. On both simulated and real robotic platforms, our approach achieves higher grasping success rates than existing vision-based RL methods, demonstrating emergent active visual behavior and strong generalization capability.

Technology Category

Application Category

📝 Abstract

This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system. Unlike state-based RL, vision-based RL is much more memory inefficient, resulting in relatively low batch sizes, which is not amenable for algorithms like PPO. Nevertheless, it is still an attractive method as unlike the more commonly used techniques which distill state-based policies into vision networks, end-to-end RL can allow for emergent active vision behaviors. We identify a key bottleneck in training these policies is the way most existing simulators scale to multiple GPUs using traditional data parallelism techniques. We propose a new method where we disaggregate the simulator and RL (both training and experience buffers) onto separate GPUs. On a node with four GPUs, we have the simulator running on three of them, and PPO running on the fourth. We are able to show that with the same number of GPUs, we can double the number of existing environments compared to the previous baseline of standard data parallelism. This allows us to train vision-based environments, end-to-end with depth, which were previously performing far worse with the baseline. We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality. This improvement is likely due to the observability gap between state and vision policies which does not exist when distilling depth policies into stereo RGB. We further show that the increased batch size brought about by disaggregated simulation also improves real world performance. When deploying in the real world, we improve upon the previous state-of-the-art vision-based results using our end-to-end policies.

Problem

Research questions and friction points this paper is trying to address.

Memory inefficiency in vision-based RL limits batch sizes for dexterous grasping

Traditional data parallelism bottlenecks multi-GPU training for simulators and RL

Performance gap exists between state-based and vision policies for robotic grasping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated simulator and RL onto separate GPUs

Distilled depth policies into stereo RGB networks

Increased batch size via multi-GPU environment scaling

🔎 Similar Papers

No similar papers found.